Background and Overview

DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:

This document is currently split between _v003 and _v003_a and _v003_b due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.

The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:


Spatial Statistics in R

Chapter 1 - Introduction

Problems in spatial statistics:

  • Epidemics, susceptibility, locations, etc.
  • Divisions of a lager area - healt disricts, counties, etc.
  • Geostatistical data is the availability of data that has a spatial component
  • At school we were taught to make the most of a piece of graph paper by scaling our data to fit the page
    • R will usually follow this advice by making a plot fill the graphics window
  • With spatial data, this can cause misleading distortion that changes the distance and direction between pairs of points
    • The data in the previous exercise was created in a tall, skinny rectangle, and it should always be shown as a tall, skinny rectangle
    • If R stretches this to fill a wide graphics window then it is misrepresenting the relationship between events in the up-down and left-right directions
  • So spatial plots should have scales so that one unit in the X axis is the same size as one unit on the Y axis
    • Circles will appear as circles and not ellipses, and squares will appear square
    • The ratio of the Y axis scale to the X axis scale is called the aspect ratio of the plot
    • Spatial data should always be presented with an aspect ratio of 1:1

Simulation and testing with spatstat:

  • A “point” is defined to be any specific (x, y) location on the 2D plane
  • An “event” is a key data point; in the literature, a point is just a location while an event is an observation or a specific point of interest
  • The “window” is the defined study area, and events outside the window are unobserved
  • A “spatial point pattern” is a set of events inside a defined window
  • A “spatial point process” is a stochastic process (RNG) for events inside a defined window
  • The spatstat library stores spatial objects inside the ppp library
    • Coordinates, window, marks, etc.
    • Defaults to a unit square for the window
    • Can plot(), print(), summary(), etc.
  • The most basic type of spatial plot is “complete spatial randomness” (csr), where no part of the window differs from the others
    • In the quadrat test, the window is sub-divided in to squares, and counts are taken within each of the squares
    • The expected distribution would be the Poisson distribution, and the assessment of fit can be made using Chi-squared for counts by bucket
    • Quadrat count tests are implemented using quadrat.test(), which takes a planar point pattern, ppp() object
  • A Poisson point process creates events according to a Poisson distribution with an intensity parameter specifying the expected events per unit area
    • The total number of events generated is a single number from a Poisson distribution, so multiple realisations of the same process can easily have different numbers of events

Further testing:

  • The quadrat test depends on selecting the right sub-window size, otherwise the test can lose power due to homogeneity
  • One alternative test is the “nearest neighbors” test - find the “nearest neighbor” for each event, calculate the distance, and plot the distribution
    • Can compare the ecdf with theoretical, accounting for edge-effect adjustments to theoretical, since events near the edge are deprived of potential close neighbors outside the window
  • Another alternative test is the Ripley’s K Function
    • Count the number of events within a circle of diameter d from a specificed event
    • Plot the resulting function and compare with theoretical (~ pi * d**2)
    • Can calculate p-values based on comparisons to random simulation
    • If the observed data are greater than random CI at any specific value for d, that suggests clustering at around that distance
  • Spatial statistics frequently uses Monte Carlo simulation to calculate CI and evaluate hypothesis tests

Example code includes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# The number of points to create
n <- 200

# Set the range
xmin <- 0
xmax <- 1
ymin <- 0
ymax <- 2

# Sample from a Uniform distribution
x <- runif(n, xmin, xmax)
y <- runif(n, ymin, ymax)


# The ratio of the Y axis scale to the X axis scale is called the aspect ratio of the plot. Spatial data should always be presented with an aspect ratio of 1:1.
# See pre-defined variables
# ls.str()

# Plot points and a rectangle

mapxy <- function(a = NA){
  plot(x, y, asp = a)
  rect(xmin, ymin, xmax, ymax)
}

mapxy(1)

# How do we create a uniform density point pattern in a circle?
# We might first try selecting radius and angle uniformly.  But that produces a "cluster" at small distances
# Instead we sample the radius from a non-uniform distribution that scales linearly with distance, so we have fewer points at small radii and more at large radii
# This exercise uses spatstat's disc() function, that creates a circular window.

# Load the spatstat package
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## Loading required package: rpart
## 
## spatstat 1.55-0       (nickname: 'Stunned Mullet') 
## For an introduction to spatstat, type 'beginner'
## 
## Note: R version 3.3.3 (2017-03-06) is more than 9 months old; we strongly recommend upgrading to the latest version
# Create this many points, in a circle of this radius
n_points <- 300
radius <- 10

# Generate uniform random numbers up to radius-squared
r_squared <- runif(n_points, 0, radius**2)
angle <- runif(n_points, 0, 2*pi)

# Take the square root of the values to get a uniform spatial distribution
x <- sqrt(r_squared) * cos(angle)
y <- sqrt(r_squared) * sin(angle)

plot(spatstat::disc(radius))
points(x, y)

# Some variables have been pre-defined
# ls.str()

# Set coordinates and window
ppxy <- ppp(x = x, y = y, window = disc(radius))

# Test the point pattern
qt <- quadrat.test(ppxy)
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
# Inspect the results
plot(qt)

print(qt)
## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  ppxy
## X2 = 33.579, df = 24, p-value = 0.1849
## alternative hypothesis: two.sided
## 
## Quadrats: 25 tiles (irregular windows)
# In the previous exercise you used a set of 300 events scattered uniformly within a circle
# If you repeated the generation of the events again you will still have 300 of them, but in different locations
# The dataset of exactly 300 points is from a Poisson point process conditioned on the total being 300
# The spatstat package can generate Poisson spatial processes with the rpoispp() function given an intensity and a window, that are not conditioned on the total
# Just as the random number generator functions in R start with an "r", most of the random point-pattern functions in spatstat start with an "r".
# The area() function of spatstat will compute the area of a window such as a disc

# Create a disc of radius 10
disc10 <- disc(10)

# Compute the rate as count divided by area
lambda <- 500 / area(disc10)

# Create a point pattern object
ppois <- rpoispp(lambda = lambda, win = disc10)

# Plot the Poisson point pattern
plot(ppois)

# The spatstat package also has functions for generating point patterns from other process modelsparameters.
# These generally fall into one of two classes: clustered processes, where points occur together more than under a uniform Poisson process, 
# and regular (aka inhibitory) processes where points are more spaced apart than under a uniform intensity Poisson process
# Some process models can generate patterns on a continuum from clustered through uniform to regular depending on their parameters

# The quadrat.test() function can test against clustered or regular alternative hypotheses
# By default it tests against either of those, but this can be changed with the alternative parameter to create a one-sided test.

# A Thomas process is a clustered pattern where a number of "parent" points, uniformly distributed, create a number of "child" points in their neighborhood
# The child points themselves form the pattern. This is an attractive point pattern, and makes sense for modelling things like trees, since new trees will grow near the original tree
# Random Thomas point patterns can be generated using rThomas()
# This takes three numbers that determine the intensity and clustering of the points, and a window object.

# Conversely the points of a Strauss process cause a lowering in the probability of finding another point nearby
# The parameters of a Strauss process can be such that it is a "hard-core" process, where no two points can be closer than a set threshold
# Creating points from this process involves some clever simulation algorithms
# This is a repulsive point pattern, and makes sense for modelling things like territorial animals, since the other animals of that species will avoid the territory of a given animal
# Random Strauss point patterns can be generated using rStrauss()
# This takes three numbers that determine the intensity and "territory" of the points, and a window object
# Points generated by a Strauss process are sometimes called regularly spaced.

# Create a disc of radius 10
disc10 <- disc(10)

# Generate clustered points from a Thomas process
set.seed(123)
p_cluster <- rThomas(kappa = 0.35, scale = 1, mu = 3, win = disc10)
plot(p_cluster)

# Run a quadrat test
quadrat.test(p_cluster, alternative = "clustered")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  p_cluster
## X2 = 53.387, df = 24, p-value = 0.0005142
## alternative hypothesis: clustered
## 
## Quadrats: 25 tiles (irregular windows)
# Regular points from a Strauss process
set.seed(123)
p_regular <- rStrauss(beta = 2.9, gamma = 0.025, R = .5, W = disc10)
## Warning: Simulation will be performed in the containing rectangle and
## clipped to the original window.
plot(p_regular)

# Run a quadrat test
quadrat.test(p_regular, alternative = "regular")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
## 
##  Chi-squared test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  p_regular
## X2 = 8.57, df = 24, p-value = 0.001619
## alternative hypothesis: regular
## 
## Quadrats: 25 tiles (irregular windows)
# Another way of assessing clustering and regularity is to consider each point, and how it relates to the other points
# One simple measure is the distribution of the distances from each point to its nearest neighbor
# The nndist() function in spatstat takes a point pattern and for each point returns the distance to its nearest neighbor

# Instead of working with the nearest-neighbor density, as seen in the histogram, it can be easier to work with the cumulative distribution function, G(r) 
# This is the probability of a point having a nearest neighbour within a distance r
# For a uniform Poisson process, G can be computed theoretically, and is G(r) = 1 - exp( - lambda * pi * r ^ 2)
# You can compute G empirically from your data using Gest() and so compare with the theoretical value.

# Events near the edge of the window might have had a nearest neighbor outside the window, and so unobserved
# This will make the distance to its observed nearest neighbor larger than expected, biasing the estimate of G
# There are several methods for correcting this bias

# Plotting the output from Gest shows the theoretical cumulative distribution and several estimates of the cumulative distribution using different edge corrections
# Often these edge corrections are almost indistinguishable, and the lines overlap
# The plot can be used as a quick exploratory test of complete spatial randomness

# Two ppp objects, p_poisson, and p_regular are defined for you
# Point patterns are pre-defined
p_poisson <- ppois
p_poisson
## Planar point pattern: 555 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Calc nearest-neighbor distances for Poisson point data
nnd_poisson <- nndist(p_poisson)

# Draw a histogram of nearest-neighbor distances
hist(nnd_poisson)

# Estimate G(r)
G_poisson <- Gest(p_poisson)

# Plot G(r) vs. r
plot(G_poisson)

# Repeat for regular point data
nnd_regular <- nndist(p_regular)
hist(nnd_regular)

G_regular <- Gest(p_regular)
plot(G_regular)

# A number of other functions of point patterns have been developed
# They are conventionally denoted by various capital letters, including F, H, J, K and L

# The K-function is defined as the expected number of points within a distance of a point of the process, scaled by the intensity
# Like G, this can be computed theoretically for a uniform Poisson process and is K(r) = pi * r ^ 2 - the area of a circle of that radius
# Deviation from pi * r ^ 2 can indicate clustering or point inhibition
# Computational estimates of K(r) are done using the Kest() function.

# As with G calculations, K-function calculations also need edge corrections
# The default edge correction in spatstat is generally the best, but can be slow, so we'll use the "border" correction for speed here

# Uncertainties on K-function estimates can be assessed by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data
# Repeat this process 99 times, and take the minimum and maximum value of K over each of the distance values
# This gives an envelope - if the K-function from the data goes above the top of the envelope then we have evidence for clustering
# If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out
# Envelopes can be computed using the envelope() function

# The plot method for estimates of K uses a formula system where a dot on the left of a formula refers to K®
# So the default plot uses . ~ r
# You can compare the estimate of K to a Poisson process by plotting . - pi * r ^ 2 ~ r
# If the data was generated by a Poisson process, then the line should be close to zero for all values of r

# Point patterns are pre-defined
p_poisson
## Planar point pattern: 555 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_cluster
## Planar point pattern: 332 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Estimate the K-function for the Poisson points
K_poisson <- Kest(p_poisson, correction = "border")

# The default plot shows quadratic growth
plot(K_poisson, . ~ r)

# Subtract pi * r ^ 2 from the Y-axis and plot
plot(K_poisson, . - pi * r**2 ~ r)

# Compute envelopes of K under random locations
K_cluster_env <- envelope(p_cluster, Kest, correction = "border")
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
# Insert the full formula to plot K minus pi * r^2
plot(K_cluster_env, . - pi * r^2 ~ r)

# Repeat for regular data
K_regular_env <- envelope(p_regular, Kest, correction = "border")
## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.
plot(K_regular_env, . - pi * r^2 ~ r)


Chapter 2 - Point Pattern Analysis

Bivariate point problems:

  • People tend to clump together in cities in neighborhoods, so “things people do/cause” (accidents, crime, etc.) are likely to be clumped even in the absence of differential rates
  • The bivariate point pattern is the solution - look at both where things do happen and also where they do not happen (e.g., with disease, looking at prevalence rate rather than total patient counts by area)
    • Can run either a bivariate nearest neighbors or a bivariate K-function
    • The null hypothesis is usually one of uniform rates, which is to say that there is no clustering of incidents and controls
  • Kernel smoothing can help with plotting the rates - replace each event with a kernel, and assign every point the sum of all kernels
    • Need to pick the shape and bandwidth of the kernel
    • The spatstat package has some reasonable defaults, and these can be over-ridden if desired

Spatial segregation:

  • The null hypothesis is typically that rates are constant wherever you go within the window (though volumes might clump)
    • The spseg() function estimates a bandwidth for the kernel function
    • This is then used for creating the rates by point, which can be plotted
  • Monte Carlo simulation can be used to permute the events, keeping event locations and relative ratio of events within the window constant
    • Can then find from simulation the 95% CI for event rate, and identify outliers on the graph
    • Can use plotmc() somehow to get the confidence intervals overlayed on the original plot, although it does not always work directly with ppp

Space-time data:

  • Can have a purely temporal process, which also includes a “window” (start and end time)
    • Each coordinate will now have a t coordinate, so frequently (x, y, t)
    • The t needs to be numeric, so make a switch if it was included as a Character or Date or POSIXct or the like
    • Can plot in several ways, though typically the time is shown as a histogram while the (x, y) are plotted as in previous chapters
  • The rggobi package can provide linked plot, where you brush point by time and see various highlighting associated to the selection

Space-time clustering:

  • Often, there is interest in testing a hypothesis of space-time independence
  • The Space-Time K Function typically has time on the y-axis and “spatial distance” on the x-axis
    • Basically, there is a cylinder of space time, for points within distance d and also within time t of each other
    • Assuming independence, then Ks(s) * Kt(t) = Kst(s, t)
    • Can define a test statistic D(s, t) = Kst(s, t) - Ks(s) * Kt(t)
  • The test consists of Monte Carlo with permuted event times - same distances, same times, but not interaction effect
    • Compare the observed test statistic D with the Monte Carlo null simulations
    • Can run stmctest(myPoints, myTimes, shapeWindow, timeWindow, dIntervals, tIntervals, nSim)

Example code includes:

# The dataset we shall use for this example consists of crimes in a 4km radius of the center of Preston, a town in north-west England
# We want to look for hotspots of violent crime in the area

# A ppp object called preston_crime has been constructed
# This is a marked point process, where each point is marked as either a "Violent Crime" or a "Non-violent crime"
# The marks for each point can be retrieved using the marks() function
# The window is a 4km circle centered on the town center

# A map image of the town from OpenStreetMap has also been loaded, called preston_osm

preston_crime <- readRDS("./RInputFiles/pcrime-spatstat.RDS")
preston_osm <- readRDS("./RInputFiles/osm_preston_gray.RDS")

# Get some summary information on the dataset
summary(preston_crime)
## Marked planar point pattern:  2036 points
## Average intensity 4.053214e-05 points per square unit
## 
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
## 
## Multitype:
##                   frequency proportion    intensity
## Non-violent crime      1812  0.8899804 3.607281e-05
## Violent crime           224  0.1100196 4.459332e-06
## 
## Window: polygonal boundary
## single connected closed polygon with 99 vertices
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
## Window area = 50231700 square units
## Fraction of frame area: 0.785
# Get a table of marks
table(marks(preston_crime))
## 
## Non-violent crime     Violent crime 
##              1812               224
# Define a function to create a map
preston_map <- function(cols = c("green","red"), cex = c(1, 1), pch = c(1, 1)) {
  raster::plotRGB(preston_osm) # from the raster package
  plot(preston_crime, cols = cols, pch = pch, cex = cex, add = TRUE, show.window = TRUE)
}

# Draw the map with colors, sizes and plot character
preston_map(
  cols = c("black", "red"), 
  cex = c(0.5, 1), 
  pch = 19
)

# One method of computing a smooth intensity surface from a set of points is to use kernel smoothing
# Imagine replacing each point with a dot of ink on absorbent paper
# Each individual ink drop spreads out into a patch with a dark center, and multiple drops add together and make the paper even darker
# With the right amount of ink in each drop, and with paper of the right absorbency, you can create a fair impression of the density of the original points
# In kernel smoothing jargon, this means computing a bandwidth and using a particular kernel function

# To get a smooth map of violent crimes proportion, we can estimate the intensity surface for violent and non-violent crimes, and take the ratio
# To do this with the density() function in spatstat, we have to split the points according to the two values of the marks and then compute the ratio of the violent crime surface to the total
# The function has sensible defaults for the kernel function and bandwidth to guarantee something that looks at least plausible

# preston_crime has been pre-defined
preston_crime
## Marked planar point pattern: 2036 points
## Multitype, with levels = Non-violent crime, Violent crime 
## window: polygonal boundary
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
# Use the split function to show the two point patterns
crime_splits <- split(preston_crime)

# Plot the split crime
plot(crime_splits)

# Compute the densities of both sets of points
crime_densities <- density(crime_splits)

# Calc the violent density divided by the sum of both
frac_violent_crime_density <- crime_densities[[2]] / 
  (crime_densities[[1]] + crime_densities[[2]])

# Plot the density of the fraction of violent crime
plot(frac_violent_crime_density)

# We can get a more principled measure of the violent crime ratio using a spatial segregation model
# The spatialkernel package implements the theory of spatial segregation

# The first step is to compute the optimal bandwidth for kernel smoothing under the segregation model
# A small bandwidth would result in a density that is mostly zero, with spikes at the event locations
# A large bandwidth would flatten out any structure in the events, resulting in a large "blob" across the whole window
# Somewhere between these extremes is a bandwidth that best represents an underlying density for the process

# spseg() will scan over a range of bandwidths and compute a test statistic using a cross-validation method
# The bandwidth that maximizes this test statistic is the one to use
# The returned value from spseg() in this case is a list, with h and cv elements giving the values of the statistic over the input h values
# The spatialkernel package supplies a plotcv function to show how the test value varies
# The hcv element has the value of the best bandwidth

# spatstat is loaded and the preston_crime object is read in

# Scan from 500m to 1000m in steps of 50m
bw_choice <- spatialkernel::spseg(
    preston_crime,
    h = seq(500, 1000, by = 50),
    opt = 1)
## 
## Calculating cross-validated likelihood function
# Plot the results and highlight the best bandwidth
spatialkernel::plotcv(bw_choice)
abline(v = bw_choice$hcv, lty = 2, col = "red")

# Print the best bandwidth
print(bw_choice$hcv)
## [1] 800
# The second step is to compute the probabilities for violent and non-violent crimes as a smooth surface, as well as the p-values for a point-wise test of segregation
# This is done by calling spseg() with opt = 3 and a fixed bandwidth parameter h

# Normally you would run this process for at least 100 simulations, but that will take too long to run here
# Instead, run for only 10 simulations
# Then you can use a pre-loaded object seg which is the output from a 1000 simulation run that took about 20 minutes to complete

# Set the correct bandwidth and run for 10 simulations only
seg10 <- spatialkernel::spseg(
    pts = preston_crime,
    h = bw_choice$hcv,
    opt = 3,
    ntest = 10,
    proc = FALSE)

# Plot the segregation map for violent crime
spatialkernel::plotmc(seg10, "Violent crime")

# Plot seg, the result of running 1000 simulations (not included here)
# spatialkernel::plotmc(seg, "Violent crime")


# With a base map and some image and contour functions we can display both the probabilities and the significance tests over the area with more control than the plotmc() function.

# The seg object is a list with several components
# The X and Y coordinates of the grid are stored in the $gridx and $gridy elements
# The probabilities of each class of data (violent or non-violent crime) are in a matrix element $p with a column for each class
# The p-value of the significance test is in a similar matrix element called $stpvalue
# Rearranging columns of these matrices into a grid of values can be done with R's matrix() function
# From there you can construct list objects with a vector $x of X-coordinates, $y of Y-coordinates, and $z as the matrix
# You can then feed this to image() or contour() for visualization

# This process may seem complex, but remember that with R you can always write functions to perform complex tasks and those you may repeat often
# For example, to help with the mapping in this exercise you will create a function that builds a map from four different items

# The seg object from 1000 simulations is loaded, as well as the preston_crime points and the preston_osm map image
# Inspect the structure of the spatial segregation object
# str(seg)

# Get the number of columns in the data so we can rearrange to a grid
# ncol <- length(seg$gridx)

# Rearrange the probability column into a grid
# prob_violent <- list(x = seg$gridx,
#                      y = seg$gridy,
#                      z = matrix(seg$p[, "Violent crime"],
#                                 ncol = ncol))
# image(prob_violent)

# Rearrange the p-values, but choose a p-value threshold
# p_value <- list(x = seg$gridx,
#                 y = seg$gridy,
#                 z = matrix(seg$stpvalue[, "Violent crime"] < 0.05,
#                            ncol = ncol))
# image(p_value)

# Create a mapping function
# segmap <- function(prob_list, pv_list, low, high){
# 
#   # background map
#   plotRGB(preston_osm)
# 
#   # p-value areas
#   image(pv_list, 
#         col = c("#00000000", "#FF808080"), add = TRUE) 
# 
#   # probability contours
#   contour(prob_list,
#           levels = c(low, high),
#           col = c("#206020", "red"),
#           labels = c("Low", "High"),
#           add = TRUE)
# 
#   # boundary window
#   plot(Window(preston_crime), add = TRUE)
# }
# 
# # Map the probability and p-value
# segmap(prob_violent, p_value, 0.05, 0.15)


# The sasquatch, or "bigfoot", is a large ape-like creature reported to live in North American forests
# The Bigfoot Field Researchers Organization maintains a database of sightings and allows its use for teaching and research
# A cleaned subset of data in north-west USA has been created as the ppp object sasq and is loaded for you to explore the space-time pattern of sightings in the area

# Get a quick summary of the dataset
sasq <- readRDS("./RInputFiles/sasquatch.RDS")
summary(sasq)
## Marked planar point pattern:  423 points
## Average intensity 2.097156e-09 points per square unit
## 
## *Pattern contains duplicated points*
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
## 
## Mark variables: date, year, month
## Summary:
##       date                 year         month    
##  Min.   :1990-05-03   Y2004  : 41   Sep    : 59  
##  1st Qu.:2000-04-30   Y2000  : 36   Oct    : 56  
##  Median :2003-11-05   Y2002  : 30   Aug    : 54  
##  Mean   :2003-08-11   Y2005  : 30   Jul    : 50  
##  3rd Qu.:2007-11-02   Y2001  : 26   Nov    : 43  
##  Max.   :2016-04-05   Y2008  : 26   Jun    : 41  
##                       (Other):234   (Other):120  
## 
## Window: polygonal boundary
## single connected closed polygon with 64 vertices
## enclosing rectangle: [368187.8, 764535.6] x [4644873, 5434933] units
## Window area = 2.01702e+11 square units
## Fraction of frame area: 0.644
# Plot unmarked points
plot(unmark(sasq))

# Plot the points using a circle sized by date
plot(sasq, which.marks = "date")

# Show the available marks
names(marks(sasq))
## [1] "date"  "year"  "month"
# Histogram the dates of the sightings, grouped by year
hist(marks(sasq)$date, "years", freq = TRUE)

# Plot and tabulate the calendar month of all the sightings
plot(table(marks(sasq)$month))

# Split on the month mark
sasq_by_month <- split(sasq, "month", un = TRUE)

# Plot monthly maps
plot(sasq_by_month)

# Plot smoothed versions of the above split maps
plot(density(sasq_by_month))

# To do a space-time clustering test with stmctest() from the splancs package, you first need to convert parts of your ppp object
# Functions in splancs tend to use matrix data instead of data frames.
# To run stmctest() you need to set up
# event locations
# event times
# region polygon
# time limits
# the time and space ranges for analysis

# The sasq object is loaded and the spatstat and splancs packages are ready for use
# Get a matrix of event coordinates
sasq_xy <- as.matrix(spatstat::coords(sasq))

# Check the matrix has two columns
dim(sasq_xy)
## [1] 423   2
# Get a vector of event times
sasq_t <- marks(sasq)$date

# Extract a two-column matrix from the ppp object
sasq_poly <- as.matrix(as.data.frame(Window(sasq)))
dim(sasq_poly)
## [1] 64  2
# Set the time limit to 1 day before and 1 day after the range of times
tlimits <- range(sasq_t) + c(-1, 1)

# Scan over 400m intervals from 100m to 20km
s <- seq(100, 20000, by = 400)

# Scan over 14 day intervals from one week to 31 weeks
tm <- seq(7, 217, by = 14)


# Everything is now ready for you to run the space-time clustering test function
# You can then plot the results and compute a p-value for rejecting the null hypothesis of no space-time clustering

# Any space-time clustering in a data set will be removed if you randomly rearrange the dates of the data points
# The stmctest() function computes a clustering test statistic for your data based on the space-time K-function - how many points are within a spatial and temporal window of a point of the data
# It then does a number of random rearrangements of the dates among the points and computes the clustering statistic
# After doing this a large number of times, you can compare the test statistic for your data with the values from the random data
# If the test statistic for your data is sufficiently large or small, you can reject the null hypothesis of no space-time clustering

# The output from stmctest() is a list with a single t0 which is the test statistic for your data, and a vector of t from the simulations
# By converting to data frame you can feed this to ggplot functions

# Because the window area is a large number of square meters, and we have about 400 events, the numerical value of the intensity is a very small number
# This makes values of the various K-functions very large numbers, since they are proportional to the inverse of the intensity
# Don't worry if you see 10^10 or higher

# The p-value of a Monte-Carlo test like this is just the proportion of test statistics that are larger than the value from the data
# You can compute this from the t and t0 elements of the output

# All the objects from the previous exercise are loaded.

# Run 999 simulations 
sasq_mc <- splancs::stmctest(sasq_xy, sasq_t, sasq_poly, tlimits, s, tm, nsim = 999, quiet = TRUE)
names(sasq_mc)
## [1] "t0" "t"
# Histogram the simulated statistics and add a line at the data value
ggplot(data.frame(sasq_mc), aes(x = t)) +
  geom_histogram(binwidth = 1e13) +
  geom_vline(aes(xintercept = t0))

# Compute the p-value as the proportion of tests greater than the data
sum(sasq_mc$t > sasq_mc$t0) / 1000
## [1] 0.04

Chapter 3 - Areal Statistics

Areal statistics:

  • Areal statistics is the idea of point data being aggregated in to a region (for example, for confidentiality reasons)
    • Boundaries are typically note created in a manner that supports good statistical analysis
  • The cartogram is one solution to areal statistics - the cartogram makes the area proportional to a unit of interest (e.g., population)
  • The simplest hypothesis is that data are spatially random, which is to say that each region is indepenent of all the other regions
    • The alternative hypothesis is that “neighbors” tend to be more similar than random regions
    • Can run packages to get the neighbors for each region, frequently defined similar to neighbors for a graph (network)
    • Can also convert the list resulting from the neighbors package to an adjacency matrix of 1/0 for who is neighbors
  • The Moran I statistic is a test statistic that can be used to assess similarities or differences among neighbors
    • Large positive values of I mean that neighbors tend to be more similar
    • Large negative values of I mean that neighbors tend to be more dissimilar
    • Near-zero values of I mean that neighbors tend to be no more or less similar than any random regions
    • Monte Carlo tests by permuting regional values can build a null distribution

Spatial health data:

  • Population health data is frequently available open-source, and often comes in standard forms
    • There is sometimes a challenge in getting the proper rates since the denominator should be “at risk” and not “total population”
    • Incidence Rate is the Number of Cases divided by Population at Risk (typically between 0 and 1)
    • Common to standardize incidence rate as Incidence-Rate-Local / Incidence-Rate-Global, known as SMR (standardized morbidity ratio)
    • The expected number of cases in a region would be Incidence-Rate-Global * Regional-AtRisk-Population (can also be converted to SMR)
  • Can run hypothesis tests and confidence intervals on the SMR
    • Can also shade maps based on SMR values (e.g., red for SMR > 2 with 99%+ probability)

Generalized linear models in space:

  • Count data are naturally modeled by the Poisson distribution where Y ~ Poisson(e ** (X %*% Beta))
  • When applying GLM to spatial data, can have a map of residuals
    • The residuals should be “uncorrelated” and look random
    • If the residuals are “correlated”, then inferences about the parameter estimates are very possibly mistaken
    • There are techniques for dealing with spatial correlations of the residuals

Correlation in spatial GLM:

  • Check whether the residuals can be explained by another known factor - add them to the model as needed
  • An additional approach is to explcitly add a spatial term S(x, y) to the model
    • The conditional autocorrelation model says that any given region should have mean equal to the average of its neighbors, with variance to be fitted
    • The “car” model stands for “conditionally autocorrelated regression”
  • Bayesian statistics are frequently used, basically, what are the likelihoods for the parameter given the observed data?
    • The credible interval is the Bayesian equivalent for confidence interval

Example code includes:

library(cartogram)
library(rgeos)
## rgeos version: 0.3-26, (SVN revision 560)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
##  Linking to sp version: 1.2-7 
##  Polygon checking: TRUE
library(spdep)
## Loading required package: sp
## Loading required package: Matrix
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge')`
## 
## Attaching package: 'spData'
## The following objects are masked _by_ '.GlobalEnv':
## 
##     x, y
library(epitools)
library(R2BayesX)
## Loading required package: BayesXsrc
## Loading required package: colorspace
## 
## Attaching package: 'colorspace'
## The following object is masked from 'package:spatstat':
## 
##     coords
## Loading required package: mgcv
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
# In 2016 the UK held a public vote on whether to remain in the European Union
# The results of the referendum, where people voted either "Remain" or "Leave", are available online
# The data set london_ref contains the results for the 32 boroughs of London, and includes the number and percentage of votes in each category as well as the count of spoilt votes, the population size and the electorate size

# The london_ref object is a SpatialPolygonsDataFrame, a special kind of data frame where each row also has the shape of the borough
# It behaves like a data frame in many respects, but can also be used to plot a choropleth, or shaded polygon, map

# You should start with some simple data exploration and mapping. The following variables will be useful:
# NAME : the name of the borough.
# Electorate : the total number of people who can vote.
# Remain, Leave : the number of votes for "Remain" or "Leave".
# Pct_Remain, Pct_Leave : the percentage of votes for each sid

# spplot() from the raster package provides a convenient way to draw a shaded map of regions

# See what information we have for each borough
london_ref <- readRDS("./RInputFiles/london_eu.RDS")
summary(london_ref)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
##      NAME             TOTAL_POP        Electorate       Votes_Cast    
##  Length:32          Min.   :157711   Min.   : 83042   Min.   : 54801  
##  Class :character   1st Qu.:237717   1st Qu.:143458   1st Qu.:104079  
##  Mode  :character   Median :272017   Median :168394   Median :116280  
##                     Mean   :270780   Mean   :169337   Mean   :118025  
##                     3rd Qu.:316911   3rd Qu.:196285   3rd Qu.:134142  
##                     Max.   :379691   Max.   :245349   Max.   :182570  
##      Remain           Leave       Rejected_Ballots   Pct_Remain   
##  Min.   : 27750   Min.   :17138   Min.   : 60.0    Min.   :30.34  
##  1st Qu.: 55973   1st Qu.:32138   1st Qu.:105.0    1st Qu.:53.69  
##  Median : 70254   Median :45263   Median :138.0    Median :61.01  
##  Mean   : 70631   Mean   :47255   Mean   :139.0    Mean   :60.46  
##  3rd Qu.: 84287   3rd Qu.:59018   3rd Qu.:164.2    3rd Qu.:69.90  
##  Max.   :118463   Max.   :96885   Max.   :267.0    Max.   :78.62  
##    Pct_Leave      Pct_Rejected      Assembly        
##  Min.   :21.38   Min.   :0.0600   Length:32         
##  1st Qu.:30.10   1st Qu.:0.0875   Class :character  
##  Median :38.99   Median :0.1100   Mode  :character  
##  Mean   :39.54   Mean   :0.1187                     
##  3rd Qu.:46.31   3rd Qu.:0.1500                     
##  Max.   :69.66   Max.   :0.2200
# Which boroughs voted to "Leave"?
london_ref$NAME[london_ref$Leave > london_ref$Remain]
## [1] "Sutton"               "Barking and Dagenham" "Bexley"              
## [4] "Havering"             "Hillingdon"
# Plot a map of the percentage that voted "Remain"
sp::spplot(london_ref, zcol = "Pct_Remain")

# Large areas, such as cities or countries, are often divided into smaller administrative units, often into zones of approximately equal population
# But the area of those units may vary considerably
# When mapping them, the large areas carry more visual "weight" than small areas, although just as many people live in the small areas.

# One technique for correcting for this is the cartogram
# This is a controlled distortion of the regions, expanding some and contracting others, so that the area of each region is proportional to a desired quantity, such as the population
# The cartogram also tries to maintain the correct geography as much as possible, by keeping regions in roughly the same place relative to each other

# The cartogram package contains functions for creating cartograms
# You give it a spatial data frame and the name of a column, and you get back a similar data frame but with regions distorted so that the region area is proportional to the column value of the regions

# You'll also use the rgeos package for computing the areas of individual regions with the gArea() function

# Use the cartogram and rgeos packages (called at top of routine)
# library(cartogram)
# library(rgeos)

# Make a scatterplot of electorate vs borough area
names(london_ref)
##  [1] "NAME"             "TOTAL_POP"        "Electorate"      
##  [4] "Votes_Cast"       "Remain"           "Leave"           
##  [7] "Rejected_Ballots" "Pct_Remain"       "Pct_Leave"       
## [10] "Pct_Rejected"     "Assembly"
plot(london_ref$Electorate, gArea(london_ref, byid = TRUE))

# Make a cartogram, scaling the area to the electorate
carto_ref <- cartogram(london_ref, "Electorate")
## Mean size error for iteration 1: 1.5881743190908
## Mean size error for iteration 2: 1.32100446455657
## Mean size error for iteration 3: 1.18227723476121
## Mean size error for iteration 4: 1.10676057030171
## Mean size error for iteration 5: 1.0657703107641
## Mean size error for iteration 6: 1.04259259672006
## Mean size error for iteration 7: 1.02832326230708
## Mean size error for iteration 8: 1.01931941526112
## Mean size error for iteration 9: 1.01341424685212
## Mean size error for iteration 10: 1.00941370606418
## Mean size error for iteration 11: 1.00663364742297
## Mean size error for iteration 12: 1.00470553629914
## Mean size error for iteration 13: 1.00336434720465
## Mean size error for iteration 14: 1.00241457265516
## Mean size error for iteration 15: 1.00174179254187
plot(carto_ref)

# Check the linearity of the electorate-area plot
plot(carto_ref$Electorate, gArea(carto_ref, byid = TRUE))

# Make a fairer map of the Remain percentage
sp::spplot(carto_ref, "Pct_Remain")

# The map of "Remain" votes seems to have spatial correlation
# Pick any two boroughs that are neighbors - with a shared border - and the chances are they'll be more similar than any two random boroughs
# This can be a problem when using statistical models that assume, conditional on the model, that the data points are independent

# The spdep package has functions for measures of spatial correlation, also known as spatial dependency
# Computing these measures first requires you to work out which regions are neighbors via the poly2nb() function, short for "polygons to neighbors"
# The result is an object of class nb
# Then you can compute the test statistic and run a significance test on the null hypothesis of no spatial correlation
# The significance test can either be done by Monte-Carlo or theoretical models

# In this example you'll use the Moran "I" statistic to test the spatial correlation of the population and the percentage "Remain" vote.

# The london_ref spatial data object is loaded for you

# Use the spdep package (called at top of routine)
# library(spdep)

# Make neighbor list
borough_nb <- poly2nb(london_ref)

# Get center points of each borough
borough_centers <- coordinates(london_ref)

# Show the connections
plot(london_ref)
plot(borough_nb, borough_centers, add = TRUE)

# Map the total pop'n
sp::spplot(london_ref, zcol = "TOTAL_POP")

# Run a Moran I test on total pop'n
moran.test(
  london_ref$TOTAL_POP, 
  nb2listw(borough_nb)
)
## 
##  Moran I test under randomisation
## 
## data:  london_ref$TOTAL_POP  
## weights: nb2listw(borough_nb)  
## 
## Moran I statistic standard deviate = 1.2124, p-value = 0.1127
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.11549264       -0.03225806        0.01485190
# Map % Remain
sp::spplot(london_ref, zcol = "Pct_Remain")

# Run a Moran I MC test on % Remain
moran.mc(
  london_ref$Pct_Remain, 
  nb2listw(borough_nb), 
  nsim = 999
)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london_ref$Pct_Remain 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = 0.42841, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
# The UK's National Health Service publishes weekly data for consultations at a number of "sentinel" clinics and makes this data available
# A dataset for one week in February 2017 has been loaded for you to analyze
# It is called london, and contains data for the 32 boroughs.

# You will focus on reports of "Influenza-like illness", or more simply "Flu"
# Your first task is to map the "Standardized Morbidity Ratio", or SMR
# This is the number of cases per person, but scaled by the overall incidence so that the expected number is 1

# The london object, a spatial data frame, and the sp package are ready for you

# Get a summary of the data set
london <- readRDS("./RInputFiles/london_2017_2.RDS")
summary(london)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
##      CODE               NAME              Flu_OBS          Vom_OBS     
##  Length:32          Length:32          Min.   :  0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 11.00   1st Qu.:14.00  
##  Mode  :character   Mode  :character   Median : 33.00   Median :40.00  
##                                        Mean   : 38.56   Mean   :37.28  
##                                        3rd Qu.: 61.00   3rd Qu.:57.50  
##                                        Max.   :112.00   Max.   :96.00  
##    Diarr_OBS        Gastro_OBS      TOTAL_POP        CCGcode         
##  Min.   :  0.00   Min.   :  0.0   Min.   :157711   Length:32         
##  1st Qu.: 22.50   1st Qu.: 48.0   1st Qu.:237717   Class :character  
##  Median : 62.00   Median :120.5   Median :272017   Mode  :character  
##  Mean   : 57.03   Mean   :113.7   Mean   :270780                     
##  3rd Qu.: 90.75   3rd Qu.:176.8   3rd Qu.:316911                     
##  Max.   :122.00   Max.   :251.0   Max.   :379691                     
##  CCG.geography.code   CCG.name         Asthma_Prevalence
##  Length:32          Length:32          Min.   :3.550    
##  Class :character   Class :character   1st Qu.:4.412    
##  Mode  :character   Mode  :character   Median :4.660    
##                                        Mean   :4.624    
##                                        3rd Qu.:4.925    
##                                        Max.   :5.470    
##  Obesity_Prevalence Cancer_Prevalence Diabetes_Prevalence     Income      
##  Min.   : 3.930     Min.   :0.870     Min.   :3.620       Min.   :0.0730  
##  1st Qu.: 5.855     1st Qu.:1.438     1st Qu.:5.265       1st Qu.:0.1308  
##  Median : 7.565     Median :1.605     Median :6.305       Median :0.1665  
##  Mean   : 7.585     Mean   :1.684     Mean   :6.245       Mean   :0.1655  
##  3rd Qu.: 8.810     3rd Qu.:1.903     3rd Qu.:7.067       3rd Qu.:0.1985  
##  Max.   :12.130     Max.   :2.540     Max.   :9.060       Max.   :0.2530  
##    Employment       Education      HealthDeprivation     Crime        
##  Min.   :0.0570   Min.   : 3.958   Min.   :-1.4100   Min.   :-0.1550  
##  1st Qu.:0.0905   1st Qu.:10.047   1st Qu.:-0.5055   1st Qu.: 0.3745  
##  Median :0.1095   Median :13.925   Median :-0.2050   Median : 0.5515  
##  Mean   :0.1092   Mean   :14.024   Mean   :-0.2044   Mean   : 0.5379  
##  3rd Qu.:0.1283   3rd Qu.:17.480   3rd Qu.: 0.2010   3rd Qu.: 0.7823  
##  Max.   :0.1560   Max.   :27.182   Max.   : 0.5430   Max.   : 1.0190  
##     Services      Environment          i        
##  Min.   :19.63   Min.   :13.37   Min.   : 0.00  
##  1st Qu.:24.43   1st Qu.:24.03   1st Qu.: 7.75  
##  Median :30.41   Median :28.20   Median :15.50  
##  Mean   :29.55   Mean   :31.38   Mean   :15.50  
##  3rd Qu.:34.74   3rd Qu.:40.15   3rd Qu.:23.25  
##  Max.   :41.89   Max.   :55.00   Max.   :31.00
# Map the OBServed number of flu reports
sp::spplot(london, "Flu_OBS")

# Compute and print the overall incidence of flu
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
r
## [1] 0.0001424128
# Calculate the expected number for each borough
london$Flu_EXP <- london$TOTAL_POP * r

# Calculate the ratio of OBServed to EXPected
london$Flu_SMR <- london$Flu_OBS / london$Flu_EXP

# Map the SMR
sp::spplot(london, "Flu_SMR")

# SMRs above 1 represent high rates of disease - but how high does an SMR need to be before it can be considered statistically significant?

# Given a number of cases and a population, its possible to work out confidence intervals at some level of the estimate of the ratio of cases per population using the properties of the binomial distribution
# The epitools package has a function binom.exact() which you can use to compute confidence intervals for the flu data
# These can be scaled to be confidence intervals on the SMR by dividing by the overall rate

# The london data set and the sp package are loaded

# For the binomial statistics function (called at top of routine)
# library(epitools)

# Get CI from binomial distribution
flu_ci <- binom.exact(london$Flu_OBS, london$TOTAL_POP)

# Add borough names
flu_ci$NAME <- london$NAME

# Calculate London rate, then compute SMR
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
flu_ci$SMR <- flu_ci$proportion / r

# Subset the high SMR data
flu_high <- flu_ci[flu_ci$SMR > 1, ]

# Plot estimates with CIs
ggplot(flu_high, aes(x = NAME, y = proportion / r, ymin = lower / r, ymax = upper / r)) +
  geom_pointrange() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Distributions and confidence intervals can be difficult things to present to non-statisticians
# An alternative is to present a probability that a value is over a threshold
# For example, public health teams might be interested in when an SMR has more than doubled, and as a statistician you can give a probability that this has happened
# Then the public health team might decide to go to some alert level when the probability of a doubling of SMR is over 0.95

# Again, the properties of the binomial distribution let you compute this for proportional data
# You can then map these exceedence probabilities for some threshold, and use a sensible color scheme to highlight probabilities close to 1

# The london data set has been loaded, and the expected flu case count, Flu_EXP has been computed

# Probability of a binomial exceeding a multiple
binom.exceed <- function(observed, population, expected, e){
    1 - pbinom(e * expected, population, prob = observed / population)
}

# Compute P(rate > 2)
london$Flu_gt_2 <- binom.exceed(
            observed = london$Flu_OBS,
            population = london$TOTAL_POP,
            expected = london$Flu_EXP,
            e = 2)

# Use a 50-color palette that only starts changing at around 0.9
pal <- c(
  rep("#B0D0B0", 40),
  colorRampPalette(c("#B0D0B0", "orange"))(5), 
  colorRampPalette(c("orange", "red"))(5)
)

# Plot the P(rate > 2) map
sp::spplot(london, "Flu_gt_2", col.regions = pal, at = seq(0, 1, len = 50))

# A Poisson generalized linear model is a way of fitting count data to explanatory variables
# You get out parameter estimates and standard errors for your explanatory variables, and can get fitted values and residuals

# The glm() function fits Poisson GLMs. It works just like the lm() function, but you also specify a family argument
# The formula has the usual meaning - response on the left of the ~, and explanatory variables on the right

# To cope with count data coming from populations of different sizes, you specify an offset argument
# This adds a constant term for each row of the data in the model. The log of the population is used in the offset term

# The london health data set has been loaded with the sp package also ready.
# Run a Poisson generalized linear model of how the count of flu cases, Flu_OBS, depends on the Health Deprivation index value, HealthDeprivation
# The first argument is the formula (response vairable on the left)
# The family argument is a function, poisson

# Fit a poisson GLM.
model_flu <- glm(
  Flu_OBS ~ HealthDeprivation, 
  offset = log(TOTAL_POP), 
  data = london, 
  family = "poisson")

# Is HealthDeprivation significant?
summary(model_flu)
## 
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = "poisson", 
##     data = london, offset = log(TOTAL_POP))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.5361  -4.5285  -0.0499   2.9043   8.2194  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -8.78190    0.02869 -306.043   <2e-16 ***
## HealthDeprivation  0.65689    0.06797    9.665   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 703.75  on 31  degrees of freedom
## Residual deviance: 605.03  on 30  degrees of freedom
## AIC: 762.37
## 
## Number of Fisher Scoring iterations: 5
# Put residuals into the spatial data.
london$Flu_Resid <- residuals(model_flu)

# Map the residuals using spplot
sp::spplot(london, "Flu_Resid")

# A linear model should fit the data and leave uncorrelated residuals
# This applies to non-spatial models, where, for example, fitting a straight line through points on a curve would lead to serially-correlated residuals
# A model on spatial data should aim to have residuals that show no significant spatial correlation

# You can test the model fitted to the flu data using moran.mc() from the spdep package
# Monte Carlo Moran tests were previously discussed in the Spatial autocorrelation test exercise earlier in the chapter

# Compute the neighborhood structure.
borough_nb <- poly2nb(london)

# Test spatial correlation of the residuals.
moran.mc(london$Flu_Resid, listw = nb2listw(borough_nb), nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london$Flu_Resid 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = 0.15059, observed rank = 925, p-value = 0.075
## alternative hypothesis: greater
# Bayesian statistical models return samples of the parameters of interest (the "posterior" distribution) based on some "prior" distribution which is then updated by the data
# The Bayesian modelling process returns a number of samples from which you can compute the mean, or an exceedence probability, or any other quantity you might compute from a distribution

# Before you fit a model with spatial correlation, you'll first fit the same model as before, but using Bayesian inference

# The london data set has been loaded
# The R2BayesX package provides an interface to the BayesX code.

# The syntax for bayesx() is similar, but the offset has to be specified explicitly from the data frame, the family name is in quotes, and the spatial data frame needs to be turned into a plain data frame
# Run the model fitting and inspect with summary()

# Plot the samples from the Bayesian model
# On the left is the "trace" of samples in sequential order, and on the right is the parameter density
# For this model there is an intercept and a slope for the Health Deprivation score
# The parameter density should correspond with the parameter summary

# Use R2BayesX (called at top of routine)
# library(R2BayesX)

# Fit a GLM
model_flu <- glm(Flu_OBS ~ HealthDeprivation, offset = log(TOTAL_POP),
                data = london, family = poisson)
                    
# Summarize it                    
summary(model_flu)
## 
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson, 
##     data = london, offset = log(TOTAL_POP))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.5361  -4.5285  -0.0499   2.9043   8.2194  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -8.78190    0.02869 -306.043   <2e-16 ***
## HealthDeprivation  0.65689    0.06797    9.665   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 703.75  on 31  degrees of freedom
## Residual deviance: 605.03  on 30  degrees of freedom
## AIC: 762.37
## 
## Number of Fisher Scoring iterations: 5
# Calculate coeff confidence intervals
confint(model_flu)
## Waiting for profiling to be done...
##                       2.5 %     97.5 %
## (Intercept)       -8.838677 -8.7261843
## HealthDeprivation  0.524437  0.7908841
# Fit a Bayesian GLM
bayes_flu <- bayesx(Flu_OBS ~ HealthDeprivation, offset = log(london$TOTAL_POP), 
                    family = "poisson", data = as.data.frame(london), 
                    control = bayesx.control(seed = 17610407))
                    
# Summarize it                    
summary(bayes_flu)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation, data = as.data.frame(london), 
##     offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407), 
##     family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -8.7831  0.0278 -8.8371 -8.7841 -8.7263
## HealthDeprivation  0.6592  0.0659  0.5345  0.6587  0.7900
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10
# Look at the samples from the Bayesian model
plot(samples(bayes_flu))

# You've fitted a non-spatial GLM with BayesX
# You can include a spatially correlated term based on the adjacency structure by adding a term to the formula specifying a spatially correlated model
# Use poly2nb() to compute the neighborhood structure of london to an nb object
# R2BayesX uses its own objects for the adjacency. Convert the nb object to a gra object
# The sx function specifies additional terms to bayesx. Create a term using the "spatial" basis and the gra object for the boroughs to define the map
# Print a summary of the model object. You should see a table of coefficients for the parametric part of the model as in the previous exercise, and then a table of "Smooth terms variance" with one row for the spatial term
# Note that since BayesX can fit many different forms in its sx terms, most of which, like the spatial model here, cannot be simply expressed with a parameter or two
# This table shows the variance of the random effects - for further explanation consult a text on random effects modelling

# Compute adjacency objects
borough_nb <- poly2nb(london)
borough_gra <- nb2gra(borough_nb)

# Fit spatial model
flu_spatial <- bayesx(
  Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", map = borough_gra),
  offset = log(london$TOTAL_POP),
  family = "poisson", data = data.frame(london), 
  control = bayesx.control(seed = 17610407)
)
## Note: created new output directory 'C:/Users/Dave/AppData/Local/Temp/Rtmpa43JUL/bayesx1'!
# Summarize the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", 
##     map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP), 
##     control = bayesx.control(seed = 17610407), family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -9.2311  0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation  0.7683  0.5844 -0.3749  0.7775  1.7922
## 
## Smooth terms variances:
##             Mean     Sd   2.5%    50%  97.5%    Min    Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10
# As with glm, you can get the fitted values and residuals from your model using the fitted and residuals functions. bayesx models are a bit more complex, since you have the linear predictor and terms from sx elements, such as the spatially correlated term
# The summary function will show you information for the linear model terms and the smoothing terms in two separate tables
# The spatial term is called "sx(i):mrf" - standing for "Markov Random Field"

# Bayesian analysis returns samples from a distribution for our S(x) term at each of the London boroughs
# The fitted function from bayesx models returns summary statistics for each borough
# You'll just look at the mean of that distribution for now

# The model from the BayesX output is available as flu_spatial.

# Summarise the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", 
##     map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP), 
##     control = bayesx.control(seed = 17610407), family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                      Mean      Sd    2.5%     50%   97.5%
## (Intercept)       -9.2311  0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation  0.7683  0.5844 -0.3749  0.7775  1.7922
## 
## Smooth terms variances:
##             Mean     Sd   2.5%    50%  97.5%    Min    Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##  
## N = 32  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10
# Map the fitted spatial term only
london$spatial <- fitted(flu_spatial, term = "sx(i):mrf")[, "Mean"]
sp::spplot(london, zcol = "spatial")

# Map the residuals
london$spatial_resid <- residuals(flu_spatial)[, "mu"]
sp::spplot(london, zcol = "spatial_resid")

# Test residuals for spatial correlation
moran.mc(london$spatial_resid, nb2listw(borough_nb), 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  london$spatial_resid 
## weights: nb2listw(borough_nb)  
## number of simulations + 1: 1000 
## 
## statistic = -0.26922, observed rank = 16, p-value = 0.984
## alternative hypothesis: greater

Chapter 4 - Geostatistics

Geostatistical data:

  • The term “geostatistics” refers to a specific type of data, frequently of a specific format
    • The locations of the data (events) are typically not interesting; they just happen to be where a monitor was set up
    • Need to consider the variable - continuous (e.g., radiation), count (e.g., bacteria), boolean (e.g., presence of virus), etc.
  • First, look for any large-scale trends, such as greater intensity to the northeast
    • Discontinuities? Need to add barriers if so
    • Always start by plotting a map, so that the type of model to create is based on EDA

Variogram:

  • Objective is often to look at similarity of items relative to distance between items (variogram shows y=difference, x=distance)
    • Typically, the distance is the Pythagorean distance
    • Typically, the difference in the items is 0.5 * (Ax - Bx)**2 where Ax and Bx are a key measurement, x, applied to each of A and B
    • The variogram can tend to become computationally intense as well as overly dense due to its inclusion of every possible pair-pair combination
    • Can instead bin the x-axis and plot mean for the y-axis, which then forms the traditional variogram
  • There are several interepretations that can be drawn from the variogram
    • If y increases with x, then there is a spatial correlation (near points are more similar than distant points)
    • If the slope of y vs. x flattens out at a certain distance, then that distance may be part of an asymptote (distances beyond the critical distance no longer drive extra differences)
    • The “nugget” variance is frequently defined as what you would expect to see if two observations are at zero distance from each either
  • The Matern class of model is frequently favored for putting a functional form to the variogram

Kriging predictions:

  • Kriging is the study of making predictions based on variograms and other geostatistical data
    • The basic maths are based on matrix multiplications, and are implemented in the gstat package
    • Outputs include the prediction and the variance (uncertainty) associated with the prediction
    • Can also look at various probabilities of exceedance given certain fixed parameters
  • Can predict at just a single location, or over an entire grid

Automatic kriging:

  • Sometimes need to bound the predictions (e.g., percentages between 0-1)
  • May have different gradients by direction due to slopes of terrain or the like
  • Kriging remains the best model currently available, despite some of the challenges and drawbacks
  • Can automatically run krigining using the automap::autoKrige() command
    • Requires that variable can be treated as Gaussian
    • Need to check variogram to see that it is reasonable
    • Useful one-step tool for getting started on analysis

Wrap up:

  • Spatial randomness
  • Point patterns
  • Aereal data
  • Geostatistics

Example code includes:

# Your job is to study the acidity (pH) of some Canadian survey data. The survey measurements are loaded into a spatial data object called ca_geo
# ca_geo has been pre-defined
ca_geo <- readRDS("./RInputFiles/ca_geo.RDS")
summary(ca_geo)
## Object of class SpatialPointsDataFrame
## Coordinates:
##         min       max
## x  542608.7  714269.2
## y 5541290.4 5652558.9
## Is projected: TRUE 
## proj4string :
## [+init=epsg:32609 +proj=utm +zone=9 +datum=WGS84 +units=m +no_defs
## +ellps=WGS84 +towgs84=0,0,0]
## Number of points: 1140
## Data attributes:
##           ID            Elev             pH              Zn        
##  102I881003:   1   Min.   :  5.0   Min.   :3.900   Min.   :  1.00  
##  102I881004:   1   1st Qu.: 20.0   1st Qu.:6.100   1st Qu.: 40.00  
##  102I881005:   1   Median :110.0   Median :6.600   Median : 62.00  
##  102I881006:   1   Mean   :183.6   Mean   :6.531   Mean   : 72.34  
##  102I881007:   1   3rd Qu.:310.0   3rd Qu.:7.000   3rd Qu.: 88.00  
##  102I881008:   1   Max.   :914.0   Max.   :8.700   Max.   :510.00  
##  (Other)   :1134   NA's   :9       NA's   :33                      
##        Cu                Pb                Ni               Co       
##  Min.   :   1.00   Min.   :  1.000   Min.   :  1.00   Min.   : 1.00  
##  1st Qu.:  21.00   1st Qu.:  1.000   1st Qu.:  7.00   1st Qu.:11.00  
##  Median :  37.00   Median :  1.000   Median : 20.00   Median :19.00  
##  Mean   :  57.45   Mean   :  2.975   Mean   : 27.85   Mean   :20.16  
##  3rd Qu.:  76.00   3rd Qu.:  3.000   3rd Qu.: 37.00   3rd Qu.:27.00  
##  Max.   :2950.00   Max.   :195.000   Max.   :340.00   Max.   :77.00  
##                                                                      
##        Ag               Mn               Fe               Mo        
##  Min.   :0.1000   Min.   :   2.0   Min.   : 0.010   Min.   : 1.000  
##  1st Qu.:0.1000   1st Qu.: 490.0   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :0.1000   Median : 820.0   Median : 5.100   Median : 1.000  
##  Mean   :0.1146   Mean   : 959.5   Mean   : 5.168   Mean   : 1.654  
##  3rd Qu.:0.1000   3rd Qu.:1200.0   3rd Qu.: 6.200   3rd Qu.: 2.000  
##  Max.   :7.9000   Max.   :9700.0   Max.   :24.000   Max.   :46.000  
##                                                                     
##        U               W               Sn                Hg       
##  Min.   :-1.00   Min.   :-1.00   Min.   :  1.000   Min.   :    5  
##  1st Qu.: 0.70   1st Qu.: 1.00   1st Qu.:  1.000   1st Qu.:   60  
##  Median : 1.10   Median : 1.00   Median :  1.000   Median :   80  
##  Mean   : 1.36   Mean   : 1.14   Mean   :  1.123   Mean   :  232  
##  3rd Qu.: 1.70   3rd Qu.: 1.00   3rd Qu.:  1.000   3rd Qu.:  120  
##  Max.   : 9.10   Max.   :32.00   Max.   :140.000   Max.   :20000  
##                                                                   
##        As               Sb                Ba             Cd        
##  Min.   :  1.00   Min.   : 0.1000   Min.   :   5   Min.   : 0.100  
##  1st Qu.:  5.00   1st Qu.: 0.1000   1st Qu.: 200   1st Qu.: 0.100  
##  Median :  6.00   Median : 0.1000   Median : 300   Median : 0.100  
##  Mean   : 10.95   Mean   : 0.2411   Mean   : 301   Mean   : 0.165  
##  3rd Qu.: 10.00   3rd Qu.: 0.1000   3rd Qu.: 390   3rd Qu.: 0.100  
##  Max.   :360.00   Max.   :15.0000   Max.   :1800   Max.   :14.800  
##                                                                    
##        V               Bi               Cr             LoI       
##  Min.   :  2.0   Min.   :0.1000   Min.   :  5.0   Min.   :-1.00  
##  1st Qu.:215.0   1st Qu.:0.1000   1st Qu.: 52.0   1st Qu.: 6.20  
##  Median :295.0   Median :0.1000   Median : 88.0   Median : 9.20  
##  Mean   :318.3   Mean   :0.1213   Mean   :114.1   Mean   :11.45  
##  3rd Qu.:410.0   3rd Qu.:0.1000   3rd Qu.:148.0   3rd Qu.:14.00  
##  Max.   :960.0   Max.   :4.2000   Max.   :860.0   Max.   :82.80  
##                                                                  
##        F               Au         
##  Min.   : 20.0   Min.   :   1.00  
##  1st Qu.:120.0   1st Qu.:   1.00  
##  Median :150.0   Median :   2.00  
##  Mean   :164.2   Mean   :  24.55  
##  3rd Qu.:200.0   3rd Qu.:   5.00  
##  Max.   :620.0   Max.   :5800.00  
## 
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# See what measurements are at each location
names(ca_geo)
##  [1] "ID"   "Elev" "pH"   "Zn"   "Cu"   "Pb"   "Ni"   "Co"   "Ag"   "Mn"  
## [11] "Fe"   "Mo"   "U"    "W"    "Sn"   "Hg"   "As"   "Sb"   "Ba"   "Cd"  
## [21] "V"    "Bi"   "Cr"   "LoI"  "F"    "Au"
# Get a summary of the acidity (pH) values
summary(ca_geo$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.900   6.100   6.600   6.531   7.000   8.700      33
# Look at the distribution
hist(ca_geo$pH)

# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
table(miss)
## miss
## FALSE  TRUE 
##  1107    33
# Plot a map of acidity
spplot(ca_geo[!miss, ], "pH")

# The acidity data shows pH broadly increasing from north-east to south-west. Fitting a linear model with the coordinates as covariates will interpolate a flat plane through the values
# ca_geo has been pre-defined
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# Are they called lat-long, up-down, or what?
coordnames(ca_geo)
## [1] "x" "y"
# Complete the formula
m_trend <- lm(pH ~ x + y, as.data.frame(ca_geo))

# Check the coefficients
summary(m_trend)
## 
## Call:
## lm(formula = pH ~ x + y, data = as.data.frame(ca_geo))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.83561 -0.32091 -0.00761  0.33188  2.06249 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.358e+01  3.002e+00   27.84   <2e-16 ***
## x           -5.691e-06  3.483e-07  -16.34   <2e-16 ***
## y           -1.313e-05  5.319e-07  -24.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5299 on 1104 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.4237, Adjusted R-squared:  0.4227 
## F-statistic: 405.9 on 2 and 1104 DF,  p-value: < 2.2e-16
# Your next task is to compute the pH at the locations that have missing data in the source. You can use the predict() function on the fitted model from the previous exercise for this
# ca_geo, miss, m_trend have been pre-defined
# ls.str()

# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)

# Create a data frame of missing data
ca_geo_miss <- as.data.frame(ca_geo)[miss, ]

# Predict pH for the missing data
predictions <- predict(m_trend, newdata = ca_geo_miss, se.fit = TRUE)

# Compute the exceedence probability
pAlkaline <- 1 - pnorm(7, mean = predictions$fit, sd = predictions$se.fit)
hist(pAlkaline)

# You can use the gstat package to plot variogram clouds and the variograms from data. Recall:
# The variogram cloud shows the differences of the measurements against distance for all pairs of data points
# The binned variogram divides the cloud into distance bins and computes the average difference within each bin
# The y-range of the binned variogram is always much smaller than the variogram cloud because the cloud includes the full range of values that go into computing the mean for the binned variogram

# The acidity survey data, ca_geo and the missing value index, miss have been pre-defined

# The gstat variogram() function uses the cloud argument to plot a variogram cloud - the default cloud parameter is FALSE

# ca_geo, miss have been pre-defined
# ls.str()

# Make a cloud from the non-missing data up to 10km
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ], cloud = TRUE, cutoff = 10000))

# Make a variogram of the non-missing data
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ]))

# You might imagine that if soil at a particular point is alkaline, then soil one metre away is likely to be alkaline too
# But can you say the same thing about soil one kilometre away, or ten kilometres, or one hundred kilometres?

# The shape of the previous variogram tells you there is a large-scale trend in the data
# You can fit a variogram considering this trend with gstat
# This variogram should flatten out, indicating there is no more spatial correlation after a certain distance with the trend taken into account

# ca_geo, miss have been pre-defined
# ls.str()

# See what coordinates are called
coordnames(ca_geo)
## [1] "x" "y"
# The pH depends on the coordinates
ph_vgm <- gstat::variogram(pH ~ x + y, ca_geo[!miss, ])
plot(ph_vgm)

# Next you'll fit a model to your variogram
# The gstat function fit.variogram() does this
# You need to give it some initial values as a starting point for the optimization algorithm to fit a better model

# The sill is the the upper limit of the model
# That is, the long-range largest value, ignoring any outliers

# A variogram has been plotted for you, and ph_vgm has been pre-defined
# Estimate some parameters by eyeballing the plot
# The nugget is the value of the semivariance at zero distance.
# The partial sill, psill is the difference between the sill and the nugget.
# Set the range to the distance at which the variogram has got about half way between the nugget and the sill

# Fit a variogram model by calling fit.variogram()
# The second argument should take the parameters you estimated, wrapped in a call to vgm()

# ca_geo, miss, ph_vgm have been pre-defined
# ls.str()

# Eyeball the variogram and estimate the initial parameters
nugget <- 0.16
psill <- 0.15
range <- 10000

# Fit the variogram
v_model <- gstat::fit.variogram(
  ph_vgm, 
  model = gstat::vgm(
    model = "Ste",
    nugget = nugget,
    psill = psill,
    range = range,
    kappa = 0.5
  )
)

# Show the fitted variogram on top of the binned variogram
plot(ph_vgm, model = v_model)

print(v_model)
##   model     psill    range kappa
## 1   Nug 0.1545116     0.00   0.0
## 2   Ste 0.1442007 14379.29   0.5
# The final part of geostatical estimation is kriging itself
# This is the application of the variogram along with the sample data points to produce estimates and uncertainties at new locations

# The computation of estimates and uncertainties, together with the assumption of a normal (Gaussian) response means you can compute any function of the estimates - for example the probability of a new location having alkaline soil

# The acidity survey data, ca_geo, the missing value index, miss, and the variogram model, v_model, have been pre-defined

# ca_geo, miss, v_model have been pre-defined
# ls.str()

# Set the trend formula and the new data
km <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = ca_geo[miss, ], model = v_model)
## [using universal kriging]
names(km)
## [1] "var1.pred" "var1.var"
# Plot the predicted values
spplot(km, "var1.pred")

# Compute the probability of alkaline samples, and map
km$pAlkaline <- 1 - pnorm(7, mean = km$var1.pred, sd = sqrt(km$var1.var))
spplot(km, "pAlkaline")

# You have been asked to produce an alkaline probability map over the study area
# To do this, you are going to do some kriging via the krige() function
# This requires a SpatialPixels object which will take a bit of data manipulation to create
# You start by defining a grid, creating points on that grid, cropping to the study region, and then finally converting to SpatialPixels
# On the way, you'll meet some new functions

# GridTopology() defines a rectangular grid. It takes three vectors of length two as inputs
# The first specifies the position of the bottom left corner of the grid
# The second specifies the width and height of each rectangle in the grid, and the third specifies the number of rectangles in each direction

# To ensure that the grid and the study area have the same coordinates, some housekeeping is involved
# SpatialPoints() converts the points to a coordinate reference system (CRS), or projection (different packages use different terminology for the same concept)
# The CRS is created by wrapping the study area in projection(), then in CRS()
# For the purpose of this exercise, you don't need to worry about exactly what these functions do, only that this data manipulation is necessary to align the grid and the study area

# Now that you have that alignment, crop(), as the name suggests, crops the grid to the study area
# Finally, SpatialPixels() converts the raster cropped gridpoints to the equivalent sp object

# The acidity survey data, ca_geo, the missing value index, miss, the variogram, vgm, and the variogram model, v_model, have been pre-defined
# A rough outline of the study area is in an object called geo_bounds

# ca_geo, geo_bounds have been pre-defined
# ls.str()

# Plot the polygon and points
geo_bounds <- readRDS("./RInputFiles/ca_geo_bounds.RDS")
plot(geo_bounds)
points(ca_geo)

# Find the corners of the boundary
bbox(geo_bounds)
##         min       max
## x  537853.4  719269.2
## y 5536290.4 5657400.9
# Define a 2.5km square grid over the polygon extent. The first parameter is
# the bottom left corner.
grid <- GridTopology(c(537853, 5536290), c(2500, 2500), c(72, 48))

# Create points with the same coordinate system as the boundary
gridpoints <- SpatialPoints(grid, proj4string = CRS(raster::projection(geo_bounds)))
plot(gridpoints)

# Crop out the points outside the boundary
cropped_gridpoints <- raster::crop(gridpoints, geo_bounds)
plot(cropped_gridpoints)

# Convert to SpatialPixels
spgrid <- SpatialPixels(cropped_gridpoints)
coordnames(spgrid) <- c("x", "y")
plot(spgrid)

# The spatial pixel grid of the region, spgrid, and the variogram model of pH, v_model have been pre-defined
# spgrid, v_model have been pre-defined
# ls.str()

# Do kriging predictions over the grid
ph_grid <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = spgrid, model = v_model)
## [using universal kriging]
# Calc the probability of pH exceeding 7
ph_grid$pAlkaline <- 1 - pnorm(7, mean = ph_grid$var1.pred, sd = sqrt(ph_grid$var1.var))

# Map the probability of alkaline samples
spplot(ph_grid, zcol = "pAlkaline")

# The autoKrige() function in the automap package computes binned variograms, fits models, does model selection, and performs kriging by making multiple calls to the gstat functions you used previously
# It can be a great time-saver but you should always check the results carefully.

# autoKrige() can try several variogram model types
# In the example, you'll use a Matern variogram model, which is commonly used in soil and forestry analyses
# You can see a complete list of available models by calling vgm() with no arguments

# The acidity survey data, ca_geo, and the missing value index, miss, have been pre-defined

# ca_geo, miss are pre-defined
# ls.str()

# Kriging with linear trend, predicting over the missing points
ph_auto <- automap::autoKrige(
  pH ~ x + y,
  input_data = ca_geo[!miss, ],
  new_data = ca_geo[miss, ],
  model = "Mat"
)
## [using universal kriging]
# Plot the variogram, predictions, and standard error
plot(ph_auto)

# You can also use autoKrige() over the spgrid grid from the earlier exercise
# This brings together all the concepts that you've learned in the chapter
# That is, kriging is great for predicting missing data, plotting things on a grid is much clearer than plotting individual points, and automatic kriging is less hassle than manual kriging

# The acidity survey data, ca_geo, the missing value index, miss, the spatial pixel grid of the region, spgrid, the manual kriging grid model, ph_grid, and the variogram model of pH, v_model have been pre-defined

# ca_geo, miss, spgrid, ph_grid, v_model are pre-defined
# ls.str()

# Auto-run the kriging
ph_auto_grid <- automap::autoKrige(pH ~ x + y, input_data = ca_geo[!miss, ], new_data = spgrid)
## [using universal kriging]
# Remember predictions from manual kriging
plot(ph_grid)

# Plot predictions and variogram fit
plot(ph_auto_grid)

# Compare the variogram model to the earlier one
v_model
##   model     psill    range kappa
## 1   Nug 0.1545116     0.00   0.0
## 2   Ste 0.1442007 14379.29   0.5
ph_auto_grid$var_model
##   model     psill    range kappa
## 1   Nug 0.1846444     0.00     0
## 2   Ste 0.1092281 13085.13     2

Spatial Analysis in R with sf and raster

Chapter 1 - Vector and Raster Spatial Data in R

Reading vector and raster data into R:

  • Packages for this course will include sf, raster, ggplot2, tmap, and dplyr
  • Reading spatial data can be done in three key ways
    • vector data can be read using sf::st_read(“myFileName”) - will guess the file type based on the file extension
    • raster data can be read using raster::raster() fr single-band/layer or raster::brick() for multi-band/layer
    • vector data can be written using sf::st_write(myObject, “myFileName”)
    • raster data can be written using raster::writeRaster(myObject, “myFileName”)
  • These are very flexible functions, capable of reading many types of formats and making smart guesses about what is in the data

Getting to know your vector data:

  • One of the big innovations of the sf package is that everything is stored as a data frame
    • The data frame has metadata such as coordinate reference system that can be seen using head()
    • The geometry is stored in a list column, which is basically a data frame column that is itself a list
    • Geometry is a special type of list, a simple features list column
  • Can call plot() and see one plot for each layer
    • Can call plot(st_geometry()) to plot only the geometry without any of the data

Getting to know your raster data:

  • Rasters will be stored as objects of class RasterLayer or RasterBrick
    • The default print option for a raster object is to show all the metadata rather than the data frame
    • The extent() call will give you the minima and maxima for the data
    • The ncell() and nlayers() calls will provide the number of cells and the number of layers
    • The crs() will grab the coordinate reference system
  • Note that the raster() and brick() commands do NOT read in the full raster values by default
    • This is because rasters can be extremely large, and memory conservation is a priority
    • The inMemory() call will return a boolean as to whether the data have been read in to memory
    • The plot() and plotRGB() functions are usually good for a quick and dirty look at the data

Example code includes:

# Load the sf package
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
# Read in the trees shapefile
trees <- st_read("./RInputFiles/ZIP Files/trees/trees.shp")
## Reading layer `trees' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\trees\trees.shp' using driver `ESRI Shapefile'
## Simple feature collection with 65217 features and 7 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -74.2546 ymin: 40.49894 xmax: -73.70078 ymax: 40.91165
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
# Read in the neighborhood shapefile
neighborhoods <- st_read("./RInputFiles/ZIP Files/neighborhoods/neighborhoods.shp")
## Reading layer `neighborhoods' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\neighborhoods\neighborhoods.shp' using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
# Read in the parks shapefile
parks <- st_read("./RInputFiles/ZIP Files/parks/parks.shp")
## Reading layer `parks' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\parks\parks.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2008 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.25609 ymin: 40.49449 xmax: -73.70905 ymax: 40.91133
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
# View the first few trees
head(trees)
## Simple feature collection with 6 features and 7 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   tree_id  nta longitude latitude stump_diam      species      boroname
## 1  558423 QN76 -73.80057 40.67035          0  honeylocust        Queens
## 2  286191 MN32 -73.95422 40.77393          0 Callery pear     Manhattan
## 3  257044 QN70 -73.92309 40.76196          0  Chinese elm        Queens
## 4  603262 BK09 -73.99866 40.69312          0       cherry      Brooklyn
## 5   41769 SI22 -74.11773 40.63166          0       cherry Staten Island
## 6   24024 SI07 -74.13116 40.62351          0    red maple Staten Island
##                     geometry
## 1 POINT (-73.80057 40.67035)
## 2 POINT (-73.95422 40.77393)
## 3 POINT (-73.92309 40.76196)
## 4 POINT (-73.99866 40.69312)
## 5 POINT (-74.11773 40.63166)
## 6 POINT (-74.13116 40.62351)
# The term "raster" refers to gridded data that can include satellite imagery, aerial photographs (like orthophotos) and other types
# In R, raster data can be handled using the raster package created by Robert J. Hijmans

# When working with raster data, one of the most important things to keep in mind is that the raw data can be what is known as "single-band" or "multi-band" and these are handled a little differently in R
# Single-band rasters are the simplest, these have a single layer of raster values -- a classic example would be an elevation raster where each cell value represents the elevation at that location
# Multi-band rasters will have more than one layer. An example is a color aerial photo in which there would be one band each representing red, green or blue light.

# Load the raster package
library(raster)
## Loading required package: sp
## 
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
## 
##     area, rotate, shift
## The following object is masked from 'package:nlme':
## 
##     getData
## The following object is masked from 'package:dplyr':
## 
##     select
# Read in the tree canopy single-band raster
canopy <- raster("./RInputFiles/ZIP Files/canopy/canopy.tif")

# Read in the manhattan Landsat image multi-band raster
manhattan <- brick("./RInputFiles/ZIP Files/manhattan/manhattan.tif")

# Get the class for the new objects
class(canopy)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
class(manhattan)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"
# Identify how many layers each object has
nlayers(canopy)
## [1] 1
nlayers(manhattan)
## [1] 3
# As mentioned in the video, spatial objects in sf are just data frames with some special properties
# This means that packages like dplyr can be used to manipulate sf objects
# In this exercise, you will use the dplyr functions select() to select or drop variables, filter() to filter the data and mutate() to add or alter columns
# Load the dplyr and sf packages
# library(dplyr)
# library(sf)

# Read in the trees shapefile (already read in above)
# trees <- st_read("trees.shp")

# Use filter() to limit to honey locust trees
honeylocust <- trees %>% filter(species == "honeylocust")

# Count the number of rows
nrow(honeylocust)
## [1] 6418
# Limit to tree_id and boroname variables
honeylocust_lim <- honeylocust %>% dplyr::select(tree_id, boroname) 

# Use head() to look at the first few records
head(honeylocust_lim)
## Simple feature collection with 6 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -73.97524 ymin: 40.67035 xmax: -73.80057 ymax: 40.83136
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   tree_id boroname                   geometry
## 1  558423   Queens POINT (-73.80057 40.67035)
## 2  548625 Brooklyn POINT (-73.97524 40.68575)
## 3  549439 Brooklyn POINT (-73.94137 40.67557)
## 4  181757 Brooklyn POINT (-73.89377 40.67169)
## 5  379387   Queens  POINT (-73.8221 40.69365)
## 6  383562    Bronx POINT (-73.90041 40.83136)
# In this exercise, you will convert the data frame to what's called a tibble with tibble::as_tibble() (Note that dplyr::tbl_df() is now deprecated)
# tibble is loaded in your workspace

# Create a standard, non-spatial data frame with one column
df <- data.frame(a = 1:3)

# Add a list column to your data frame
df$b <- list(1:4, 1:5, 1:10)

# Look at your data frame with head
head(df)
##   a                             b
## 1 1                    1, 2, 3, 4
## 2 2                 1, 2, 3, 4, 5
## 3 3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# Convert your data frame to a tibble and print on console
as_tibble(df)
## # A tibble: 3 x 2
##       a b         
##   <int> <list>    
## 1     1 <int [4]> 
## 2     2 <int [5]> 
## 3     3 <int [10]>
# Pull out the third observation from both columns individually
df$a[3]
## [1] 3
df$b[3]
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
# There are several functions in sf that allow you to access geometric information like area from your vector features
# For example, the functions st_area() and st_length() return the area and length of your features, respectively
# Note that the result of functions like st_area() and st_length() will not be a traditional vector
# Instead the result has a class of units which means the vector result is accompanied by metadata describing the object's units
# you need to either remove the units with unclass()
# or you need to convert val's class to units such as with units(val) <- units(result)

# sf and dplyr are loaded in your workspace

# Read in the parks shapefile (already read in above)
# parks <- st_read("parks.shp")

# Compute the areas of the parks
areas <- st_area(parks)

# Create a quick histogram of the areas using hist
hist(areas, xlim = c(0, 200000), breaks = 1000)

# Filter to parks greater than 30000 (square meters)
big_parks <- parks %>% filter(unclass(areas) > 30000)

# Plot just the geometry of big_parks
plot(st_geometry(big_parks))

# Plot the parks object using all defaults
plot(parks)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to
## plot all

# Plot just the acres attribute of the parks data
plot(parks["acres"])

# Create a new object of just the parks geometry
parks_geo <- st_geometry(parks)

# Plot the geometry of the parks data
plot(parks_geo)


# Load the raster package
# library(raster)

# Read in the rasters (done previously)
# canopy <- raster("canopy.tif")
# manhattan <- brick("manhattan.tif")

# Get the extent of the canopy object
extent(canopy)
## class       : Extent 
## xmin        : 1793685 
## xmax        : 1869585 
## ymin        : 2141805 
## ymax        : 2210805
# Get the CRS of the manhattan object
crs(manhattan)
## CRS arguments:
##  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Determine the number of grid cells in both raster objects
ncell(manhattan)
## [1] 619173
ncell(canopy)
## [1] 58190
# Raster data can be very big depending on the extent and resolution (grid size)
# In order to deal with this the raster() and brick() functions are designed to only read in the actual raster values as needed
# To show that this is true, you can use the inMemory() function on an object and it will return FALSE if the values are not in memory
# If you use the head() function, the raster package will read in only the values needed, not the full set of values
# The raster values will be read in by default if you perform spatial analysis operations that require it or you can read in the values from a raster manually with the function getValues()

graphics.off()

# Check if the data is in memory
inMemory(canopy)
## [1] FALSE
# Use head() to peak at the first few records
head(canopy)
##        1     2     3     4     5     6     7     8     9    10    11    12
## 1   0.00 19.35 47.88 17.17 54.27 70.93 81.18 84.23 88.86 87.17 82.27 81.65
## 2   0.00 10.65 58.61 28.77 51.19 53.65 85.29 88.81 89.00 84.59 79.00 87.18
## 3   0.00  0.00 17.26 28.72 49.04 43.84 76.13 83.78 88.30 76.47 84.44 69.35
## 4   0.00  0.96 23.81 64.48 38.24 36.16 79.26 87.02 83.80 70.21 34.33 16.14
## 5   0.00  6.97 38.18 81.83 48.02 52.84 71.18 87.21 76.72 72.90  7.12 47.38
## 6  14.89 31.42 34.17 51.29 85.26 70.05 74.99 83.52 83.98 75.74 41.93 84.72
## 7  65.06 37.87 30.91 23.92 35.21 53.85 85.32 85.59 85.63 76.34 77.72 81.97
## 8  68.95 43.38 37.51 22.02 27.54 72.25 84.80 86.20 86.14 84.44 82.56 67.32
## 9  58.23 33.00 43.03 12.07 19.64 76.00 76.35 76.53 83.94 85.48 83.76 41.86
## 10 46.31 53.63 23.67 10.73 48.16 60.86 63.47 69.98 61.79 55.78 34.47 49.32
##       13    14    15    16    17    18    19    20
## 1  77.95 80.72 64.98 80.05 66.04 34.45 28.03 54.67
## 2  73.62 84.57 72.14 83.08 72.75 13.29 33.37 42.57
## 3  66.02 83.02 72.60 77.81 60.89 49.96 27.29 41.00
## 4  72.36 82.51 72.86 78.84 70.87 30.37 47.14 55.99
## 5  75.38 87.18 74.90 81.76 60.04 24.07 45.37 52.69
## 6  85.36 84.52 65.53 68.28 59.58 52.38 31.21 33.98
## 7  65.10 63.87 48.16 30.90 39.24 29.74  5.78 14.39
## 8   6.58 51.03 21.22 40.49 29.76 22.16  6.76 46.05
## 9   9.47 23.15 50.56 44.56 19.28 32.92 52.94 43.42
## 10  8.55 21.59 55.36 54.16 45.76 47.89 59.27 47.48
# Use getValues() to read the values into a vector
vals <- getValues(canopy)

# Use hist() to create a histogram of the values
hist(vals)


# The raster package has added useful methods for plotting both single and multi-band rasters
# For single-band rasters or for a map of each layer in a multi-band raster you can simply use plot()
# If you have a multi-band raster with layers for red, green and blue light you can use the plotRGB() function to plot the raster layers together as a single image

# Plot the canopy raster (single raster)
plot(canopy)

# Plot the manhattan raster (as a single image for each layer)
plot(manhattan)

# Plot the manhattan raster as an image
plotRGB(manhattan)

# raster masks dplyr::select
detach("package:raster")

Chapter 2 - Preparing Layers for Spatial Analysis

Quick refresher on coordinate reference systems (CRS):

  • Places on the earth defined by lat/lon or x/y are based on complex representations of the globe to a plane
    • An unprojected CRS uses latitude and longitude and references the earth as a 3D object
    • A projected CRS uses X and Y coordinates as a 2D representation of the earth
    • Best practice is to include metadata referencing the CRS used, though this does not always happen
    • Both sf and raster will read the CRS metadata if it has been provided and attach it to the R objects
  • Can interpret the results of st_crs() as to what CRS has been used and whether it is projected
    • If the $projstring starts with “+proj=longlat . . .” then it is a projected CRS
  • If the CRS is missing in vector data, it can typically be added using either EPSG or proj4string
    • EPSG is a numeric representation of a CRS
    • proj4string is a full set of parameters spelled out in a string
    • Need to do background research to find out what CRS should have been applied
    • Once the proper CRS has been located, it can be added using st_crs() <- anEPSGNum or st(crs() <- aCRSString
  • Determining or adding the CRS to raster data is very similar
    • Use crs() rather than st_crs()
  • Since all of the objects in a plot need to be on a consistent CRS, it is possible to run the conversions from/to
    • st_transform(myVec, myEPSGNum) for vectors
    • projectRaster(myRaster, myCRSString) for rasters
    • projectRaster(myRaster, “+init=epsg:nnnnn”) where nnnnn is the EPSG numeric code is the proper format for raster CRS changes

Manipulating vector layers with dplyr:

  • The dplyr::count(myVector, myVar) %>% dplyr::arrange(dplyr::desc(n)) will create a multi-point object of all the geometries in the list column
    • Can always get rid of geometry with x <- st_set_geometry(x, NULL)
    • The dplyr::inner_join() can be used with the vector data, provided that there are appropriate matching keys
  • Vector data is often much more detailed than required for a plot or analysis
    • It is often valuable to simplify lines and polygons
    • Can run st_simplify(myVector, dTolerance=) where the units of dTolerance are as per the CRS and larger numbers mean more simplification

Converting sf objects into sp objects and coordinates:

  • The sp package has had a long life and is frequently used in packages; conversions between sp and sf is critical
  • Converting an sf object to sp can be done with the as() function
    • as(mySF, class=“Spatial”) # will convert to an sp object
  • Converting from an sp object to an sf object can be done with the st_as_sf() function
    • st_as_sf(mySP) # will convert to an sf object
  • Creating an sf object from a data frame and coordinate system can be done with the st_as_sf() function
    • st_as_sf(myDF, coords=c(“long”, “lat”), crs=) # will use myDF\(long and myDF\)lat to create the CRS
  • When writing to a CSV using st_write(), the CRS is not included by default
    • Can froce the CRS to be written using st_write(mySF, “myFile.csv”, layer_options = “GEOMETRY=AS_XY”)

Manipulating raster layers:

  • Can reduce raster resolution using aggregate() and can reclassify values using reclassify()
  • The aggregate() function helps reduce the size of a raster file by averaging across pixels to create larger pixels
    • aggregate(myRaster, fact=, fun=) # the fact is a numeric that is the factor of aggregation while fun is the function to use (much like apply FUN, but lower-case)
    • Note that the raster is aggregated by fact in both dimensions, for a total reduction of approximately fact ** 2
  • The reclassify() call can be used to eliminate missing values, make outliers in to missing values, and the like
    • reclassify(myRaster, rcl=aMatrix) # where aMatrix can have 2-3 columns
    • The 3-column matrix (such as cbind(1, 3, NA)) would mean classify values between 1 and 3 to NA
    • The 2-column matrix (such as cbind(1, 3)) would mean classify all values as 1 to 3
    • Each row of the matrix represents a rule - can have multiple rows

Example code includes:

library(sf)
library(raster)
## 
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
## 
##     shift
## The following object is masked from 'package:qdapRegex':
## 
##     bind
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:colorspace':
## 
##     RGB
## The following objects are masked from 'package:spatstat':
## 
##     area, rotate, shift
## The following object is masked from 'package:nlme':
## 
##     getData
## The following object is masked from 'package:dplyr':
## 
##     select
# In order to perform any spatial analysis with more than one layer, your layers should share the same coordinate reference system (CRS) and the first step is determining what coordinate reference system your data has
# To do this you can make use of the sf function st_crs() and the raster function crs()

# When the geographic data you read in with sf already has a CRS defined both sf and raster will recognize and retain it
# When the CRS is not defined you will need to define it yourself using either the EPSG number or the proj4string


# Determine the CRS for the neighborhoods and trees vector objects
st_crs(neighborhoods)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
# Assign the CRS to trees
crs_1 <- "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees) <- crs_1

# Determine the CRS for the canopy and manhattan rasters
crs(canopy)
## CRS arguments:
##  +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
crs(manhattan)
## CRS arguments:
##  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Assign the CRS to manhattan
crs_2 <- "+proj=utm +zone=18 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
crs(manhattan) <- crs_2


# In this exercise you will transform (sometimes this is called "project") the objects so they share a single CRS
# It is generally best to perform spatial analysis with layers that have a projected CRS (and some functions require this)
# To determine if your object has a projected CRS you can look at the first part of the result from st_crs() or crs() -- if it begins with +proj=longlat then your CRS is unprojected

# Note that you will use method = "ngb" in your call to projectRaster() to prevent distortion in the manhattan image

# Get the CRS from the canopy object
the_crs <- crs(canopy, asText = TRUE)

# Project trees to match the CRS of canopy
trees_crs <- st_transform(trees, crs = the_crs)

# Project neighborhoods to match the CRS of canopy
neighborhoods_crs <- st_transform(neighborhoods, crs = the_crs)

# Project manhattan to match the CRS of canopy
manhattan_crs <- projectRaster(manhattan, crs = the_crs, method = "ngb")

# Look at the CRS to see if they match
st_crs(trees_crs)
## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
st_crs(neighborhoods_crs)
## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
crs(manhattan_crs)
## CRS arguments:
##  +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# If the layers do not share a common CRS they may not align on a plot
# To illustrate, in this exercise, you will initially create a plot with the plot() function and try to add two layers that do not share the same CRS
# You will then transform one layer's CRS to match the other and you will plot this with both the plot() function and functions from the tmap package.

# Note that for this exercise we returned all the layers to their original CRS and did not retain the changes you made in the last exercise

# With the plot() function you can plot multiple layers on the same map by calling plot() multiple times
# You'll need to add the argument add = TRUE to all calls to plot() after the first one and you need to run the code for all layers at once rather than line-by-line

# Plot canopy and neighborhoods (run both lines together)
# Do you see the neighborhoods?
plot(canopy)
plot(neighborhoods$geometry, add = TRUE)

# See if canopy and neighborhoods share a CRS
st_crs(neighborhoods)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
crs(canopy)
## CRS arguments:
##  +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# Save the CRS of the canopy layer
the_crs <- crs(canopy, asText = TRUE)

# Transform the neighborhoods CRS to match canopy
neighborhoods_crs <- st_transform(neighborhoods, crs=the_crs)

# Re-run plotting code (run both lines together)
# Do the neighborhoods show up now?
plot(canopy)
plot(neighborhoods_crs$geometry, add = TRUE)

# Simply run the tmap code
tmap::tm_shape(canopy) + 
    tmap::tm_rgb() + 
    tmap::tm_shape(neighborhoods_crs) + 
    tmap::tm_polygons(alpha = 0.5)

# One of the great innovations of sf over sp is the use of data frames for storing spatial objects
# This allows you to slice and dice your spatial data in the same way you do for non-spatial data
# This means you can, for example, apply dplyr verbs directly to your sf object

# One important difference between dplyr with and without spatial data is that the resulting data frames will include the geometry variable unless you explicitly drop it
# If you want to force the geometry to be dropped you would use the sf function st_set_geometry() and you would set the geometry to NULL

# The packages sf and dplyr, and the object trees are loaded in your workspace

# Create a data frame of counts by species
species_counts <- count(trees, species)

# Arrange in descending order
species_counts_desc <- arrange(species_counts, desc(n)) 

# Use head to see if the geometry column is in the data frame
head(species_counts_desc)
## Simple feature collection with 6 features and 2 fields
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: -74.25443 ymin: 40.49894 xmax: -73.70104 ymax: 40.91165
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
##   species               n                       geometry
##   <fct>             <int>         <sf_geometry [degree]>
## 1 London planetree   8709 MULTIPOINT (-74.25408 40.50...
## 2 honeylocust        6418 MULTIPOINT (-74.25426 40.50...
## 3 Callery pear       5902 MULTIPOINT (-74.25443 40.50...
## 4 pin oak            5355 MULTIPOINT (-74.25329 40.50...
## 5 Norway maple       3373 MULTIPOINT (-74.25443 40.50...
## 6 littleleaf linden  3043 MULTIPOINT (-74.25032 40.51...
# Drop the geometry column
species_no_geometry <- st_set_geometry(species_counts_desc, NULL)

# Confirm the geometry column has been dropped
head(species_no_geometry)
## # A tibble: 6 x 2
##   species               n
##   <fct>             <int>
## 1 London planetree   8709
## 2 honeylocust        6418
## 3 Callery pear       5902
## 4 pin oak            5355
## 5 Norway maple       3373
## 6 littleleaf linden  3043
# In this exercise you will test joining spatial and non-spatial data. In particular, the trees data you have been working with has a full county name (the variable is called boroname) but does not have the county codes. The neighborhoods file has both a county name (the variable is called boro_name) and the county codes -- neighborhoods are nested within counties
# In this exercise, you will create a non-spatial data frame of county name and county code from the neighborhoods object
# Then you will join this data frame into the spatial trees object with inner_join()

# The packages sf and dplyr and the objects neighborhoods and trees are loaded in your workspace

# Limit to the fields boro_name, county_fip and boro_code
boro <- dplyr::select(neighborhoods, boro_name, county_fip, boro_code)

# Drop the geometry column
boro_no_geometry <- st_set_geometry(boro, NULL)

# Limit to distinct records
boro_distinct <- distinct(boro_no_geometry)

# Join the county detail into the trees object
trees_with_county <- inner_join(trees, boro_distinct, by = c("boroname" = "boro_name"))

# Confirm the new fields county_fip and boro_code exist
head(trees_with_county)
## Simple feature collection with 6 features and 9 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   tree_id  nta longitude latitude stump_diam      species      boroname
## 1  558423 QN76 -73.80057 40.67035          0  honeylocust        Queens
## 2  286191 MN32 -73.95422 40.77393          0 Callery pear     Manhattan
## 3  257044 QN70 -73.92309 40.76196          0  Chinese elm        Queens
## 4  603262 BK09 -73.99866 40.69312          0       cherry      Brooklyn
## 5   41769 SI22 -74.11773 40.63166          0       cherry Staten Island
## 6   24024 SI07 -74.13116 40.62351          0    red maple Staten Island
##   county_fip boro_code                   geometry
## 1        081         4 POINT (-73.80057 40.67035)
## 2        061         1 POINT (-73.95422 40.77393)
## 3        081         4 POINT (-73.92309 40.76196)
## 4        047         3 POINT (-73.99866 40.69312)
## 5        085         5 POINT (-74.11773 40.63166)
## 6        085         5 POINT (-74.13116 40.62351)
# In sf you can use the st_simplify() function to reduce line and polygon complexity
# In this exercise you will measure the size of objects before and after st_simplify() in two ways
# You will compute the size in megabytes using the handy object_size() function in the pryr package and you will count the number of vertices -- the number of points required to delineate a line or polygon

# The packages sf and pryr are loaded in your workspace
# Plot the neighborhoods geometry
plot(st_geometry(neighborhoods), col = "grey")

# Measure the size of the neighborhoods object
utils::object.size(neighborhoods)
## 1890408 bytes
# Compute the number of vertices in the neighborhoods object
pts_neighborhoods <- st_cast(neighborhoods$geometry, "MULTIPOINT")
cnt_neighborhoods <- sapply(pts_neighborhoods, length)
sum(cnt_neighborhoods)
## [1] 210736
# Simplify the neighborhoods object
neighborhoods_simple <- st_simplify(neighborhoods, 
                            preserveTopology = TRUE, 
                            dTolerance = 0.0025)
## Warning in st_simplify.sfc(st_geometry(x), preserveTopology, dTolerance):
## st_simplify does not correctly simplify longitude/latitude data, dTolerance
## needs to be in decimal degrees
# Measure the size of the neighborhoods_simple object
utils::object.size(neighborhoods_simple)
## 248448 bytes
# Compute the number of vertices in the neighborhoods_simple object
pts_neighborhoods_simple <- st_cast(neighborhoods_simple$geometry, "MULTIPOINT")
cnt_neighborhoods_simple <- sapply(pts_neighborhoods_simple, length)
sum(cnt_neighborhoods_simple)
## [1] 4764
# Plot the neighborhoods_simple object geometry
plot(st_geometry(neighborhoods_simple), col = "grey")

# Read in the trees data (done previously)
# trees <- st_read("trees.shp")

# Convert to Spatial class
trees_sp <- as(trees, Class = "Spatial")

# Confirm conversion, should be "SpatialPointsDataFrame"
class(trees_sp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Convert back to sf
trees_sf <- st_as_sf(trees_sp)

# Confirm conversion
class(trees_sf)
## [1] "sf"         "data.frame"
# In order to convert a data frame of coordinates into an sf object you can make use of the st_as_sf() function you used in the previous exercise
# You can specify the coords argument with the names of the coordinate variables (with the X coordinate/longitude coordinate listed first) and, optionally, the crs argument if you know the CRS of your coordinates
# The CRS can be specified as a proj4 string or EPSG code

# If you want to convert your sf point objects to a data frame with coordinates, you can use the st_write() function with a 
# hidden argument (these are arguments associated with an external utility called GDAL and so they're not in the R help) to force sf to include the coordinates in the output file
# The argument you need is layer_options = "GEOMETRY=AS_XY"

# Read in the CSV (done previously)
# trees <- read.csv("trees.csv")

# Convert the data frame to an sf object
trees_sf <- st_as_sf(trees, coords = c("longitude", "latitude"), crs = 4326)

# Plot the geometry of the points
plot(st_geometry(trees_sf))

# Write the file out with coordinates
st_write(trees_sf, "./RInputFiles/new_trees.csv",  layer_options = "GEOMETRY=AS_XY", delete_dsn = TRUE)
## Deleting source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## Writing layer `new_trees' to data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## options:        GEOMETRY=AS_XY 
## features:       65217
## fields:         7
## geometry type:  Point
# Read in the file you just created and check coordinates
new_trees <- read.csv("./RInputFiles/new_trees.csv")
head(new_trees)
##           X        Y tree_id  nta longitude latitude stump_diam
## 1 -73.80057 40.67035  558423 QN76 -73.80057 40.67035          0
## 2 -73.95422 40.77393  286191 MN32 -73.95422 40.77393          0
## 3 -73.92309 40.76196  257044 QN70 -73.92309 40.76196          0
## 4 -73.99866 40.69312  603262 BK09 -73.99866 40.69312          0
## 5 -74.11773 40.63166   41769 SI22 -74.11773 40.63166          0
## 6 -74.13116 40.62351   24024 SI07 -74.13116 40.62351          0
##        species      boroname
## 1  honeylocust        Queens
## 2 Callery pear     Manhattan
## 3  Chinese elm        Queens
## 4       cherry      Brooklyn
## 5       cherry Staten Island
## 6    red maple Staten Island
# Read in the canopy layer (done previously)
# canopy <- raster("canopy.tif")

# Plot the canopy raster
plot(canopy)

# Determine the raster resolution
res(canopy)
## [1] 300 300
# Determine the number of cells
ncell(canopy)
## [1] 58190
# Aggregate the raster
canopy_small <- aggregate(canopy, fact = 10)

# Plot the new canopy layer
plot(canopy_small)

# Determine the new raster resolution
res(canopy_small)
## [1] 3000 3000
# Determine the number of cells in the new raster
ncell(canopy_small)
## [1] 598
# Plot the canopy layer to see the values above 100
plot(canopy)

# Set up the matrix
vals <- cbind(100, 300, NA)

# Reclassify 
canopy_reclass <- reclassify(canopy, rcl = vals)

# Plot again and confirm that the legend stops at 100
plot(canopy_reclass)

# raster masks dplyr::select
detach("package:raster")

Chapter 3 - Conducting Spatial Analysis with sf and raster

Buffers and centroids:

  • Best practices are to use a projected coordinate reference system, and to use the same CRS for all layers of the analysis or plotting
  • Buffering is a common first step to use - can be run as st_buffer(myVector, mySize) # units of mySize are as per the units in the myVector file
    • plot(bufferData) # will plot circles of size mySize around the point in the myVector data
  • Calculating centroids is another common usage - can be run as st_centroid(myShapeFile)

Bounding boxes, dissolve features and create a convex hull:

  • Regions are typically defined either as 1) rectangular bounding boxes, or 2) tighter polygons with convex hulls
    • Can get the bounding box coordinates using st_bbox(myShapeFile)
    • Can create a bounding box using st_make_grid(myPoly, n=1) # the n=1 is vital and requests return of a bounding box (one-cell grid)
  • Features can be dissolved using st_union
    • For polygons, you dissolve multiple meatures in to a single feature
    • For points, you cluster multiple points in to a single multi-point
    • st_union(myPolygons) # will create a single polygon reflecting all the territory covered by any of the individual myPolygons
    • st_union(myPoints) # will create a multi-point out of the individual points
    • The value of the multi-point is in creating the convex hull (bounding boxes make no sense for a single point, but are OK for a multi-point)
  • Can create the convex hull using st_convex_hull(myMultiPoint)

Multi-layer geoprocessing and relationships:

  • Linking features from multiple layers (spatial join)
    • The st_join() is like the inner_join() except that the st_join() is merging on geography rather than on a key by variable
    • st_join(myCoreData, myAddData) - appears to make multiple rows for items in myCoreData if they happen to touch 2+ items in myAddData
  • Determining relationships between features from multiple layers (e.g., intersect, distance)
    • Can use st_intersects(a, b) to get all the members of b that are at least partially contained within a
    • Can use st_contains(a, b) to get all the members of b that are fully contained by a
    • Can use st_intersection(a, b) to get a full clipping
    • Can calculate distances using st_distance(a, b) - results in a matrix of distance from each element of a to each element of b

Geoprocessing with rasters:

  • The raster package has not yet been updated to handle sf objects, so for now, the sp class is used instead - convert using as(mySF, “Spatial”)
  • Can mask a raster using mask()
    • myMask = mask(myData, mask=) # the items passed to max= should be of appropriate class, such as sp
    • The myMask object, when plotted, will only show the items inside the requested mask=
  • Can crop a raster using crop()
    • myCrop = crop(myData, myCropSP) # crops to include all the polygons in myCropSP
    • Often beneficial to both crop (reduce bounding box size) and mask (see only relevant items)
  • Can extract values using the extract() function
    • With points, the raster values under each point are returned
    • With polygons, the fun option controls the return (fun=NULL will return all points) - (fun= will return a summary object)
  • Can run raster math using overlay()
    • Basic raster math function could be f <- function(r1, r2) r1 * r2
    • Can then run overlay(rasterA, rasterB, fun=f) to get the multiplication

Example code includes:

library(raster)
## 
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
## 
##     shift
## The following object is masked from 'package:qdapRegex':
## 
##     bind
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:colorspace':
## 
##     RGB
## The following objects are masked from 'package:spatstat':
## 
##     area, rotate, shift
## The following object is masked from 'package:nlme':
## 
##     getData
## The following object is masked from 'package:dplyr':
## 
##     select
# Computing buffers is a key spatial analysis skill and the resulting buffers have a wide range of uses like, for example, identifying the number of roads within one kilometer of a school 
# or computing the number of hazardous waste sites near sensitive natural areas

# Although, technically you can buffer data with unprojected coodinate reference systems, the buffer distance will be more meaningful with a projected CRS 
# so it is highly recommended that you transform unprojected data to a projected CRS before buffering

df <- data.frame(place=c("Empire State Building", "Museum of Natural History"), 
                 longitude=c(-73.98566, -73.97398), 
                 latitude=c(40.74844, 40.78132), 
                 stringsAsFactors = TRUE
                 )

# Review df
df
##                       place longitude latitude
## 1     Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
# Convert the data frame to an sf object             
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs=4326)

# Transform the points to match the manhattan CRS
df_crs <- st_transform(df_sf, crs = crs(manhattan, asText = TRUE))

# Buffer the points
df_buf <- st_buffer(df_crs, dist = 1000)


# Plot the manhattan image (it is multi-band)
plotRGB(manhattan)
plot(st_geometry(df_buf), col = "firebrick", add = TRUE)
plot(st_geometry(df_crs), pch = 16, add = TRUE)

# Similar to buffering, computing polygon centroids is a bedrock geoprocessing task used to assign values and even to help with labeling maps. The function for this in sf is st_centroid()
# Also similar to buffering, centroid calculations should generally be performed on data with a projected coordinate reference system

# Read in the neighborhods shapefile (done previously)
# neighborhoods <- st_read("neighborhoods.shp")

# Project neighborhoods to match manhattan
neighborhoods_tf <- st_transform(neighborhoods, crs = 32618)

# Compute the neighborhood centroids
centroids <- st_centroid(neighborhoods_tf)

# Plot the neighborhood geometry
plot(st_geometry(neighborhoods_tf), col = "grey", border = "white")
plot(centroids$geometry, pch = 16, col = "firebrick", add = TRUE)

# You can compute bounding boxes around vector data using sf
# These can help you, for example, create polygons to clip layers to a common area for an analysis or identify regions of influence

# In the sf package, there is a function for extracting the bounding box coordinates, if that's all you need, this is st_bbox()
# More likely you'll want to create a new sf object (a polygon) from those coordinates and to do this sf provides the st_make_grid() function

# st_make_grid() can be used to make a multi-row and multi-column grid covering your input data but it can also be used to make a grid of just one cell (a bounding box)
# To do this, you need to specify the number of grid cells as n = 1

# Use filter() to limit to honey locust trees
beech <- trees %>% filter(species %in% c("European beech", "American beech"))
str(beech)
## Classes 'sf' and 'data.frame':   27 obs. of  8 variables:
##  $ tree_id   : num  182961 163271 221707 16250 183728 ...
##  $ nta       : Factor w/ 188 levels "BK09","BK17",..: 169 107 119 58 180 146 106 135 169 171 ...
##  $ longitude : num  -73.9 -74 -73.8 -73.8 -74.1 ...
##  $ latitude  : num  40.8 40.8 40.7 40.8 40.6 ...
##  $ stump_diam: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ species   : Factor w/ 132 levels "'Schubert' chokecherry",..: 50 50 2 50 2 2 2 50 2 2 ...
##  $ boroname  : Factor w/ 5 levels "Bronx","Brooklyn",..: 4 3 4 1 5 4 3 4 4 5 ...
##  $ geometry  :sfc_POINT of length 27; first list element: Classes 'XY', 'POINT', 'sfg'  num [1:2] -73.9 40.8
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr  "tree_id" "nta" "longitude" "latitude" ...
# Plot the neighborhoods and beech trees
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")

# Compute the coordinates of the bounding box
st_bbox(beech)
##      xmin      ymin      xmax      ymax 
## -74.17746  40.54247 -73.70872  40.85696
# Create a bounding box polygon
beech_box <- st_make_grid(beech, n = 1)

# Plot the neighborhoods, add the beech trees and add the new box
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")
plot(beech_box, add = TRUE)

# In order to compute a tighter bounding box, a convex hull, around a set of points like the beech trees from the previous exercise you'll need to learn one more function first

# For points you don't want a convex hull around each point! This doesn't even make sense
# More likely you want to compute a convex hull around all your points
# If you have a set of points and you want to draw a convex hull around them you first need to bundle the points into a single MULTIPOINT feature and in order to do this you will use the dissolve function in sf called st_union()

# With polygons, st_union() will dissolve all the polygons into a single polygon representing the area where all the polygons overlap
# Your set of individual points will be dissolved/unioned into a single, MULTIPOINT feature that you can use for tasks like computing the convex hull

# Buffer the beech trees by 3000
beech_buffer <- st_buffer(beech, 0.025)
## Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs): st_buffer does
## not correctly buffer longitude/latitude data
## dist is assumed to be in decimal degrees (arc_degrees).
# Limit the object to just geometry
beech_buffers <- st_geometry(beech_buffer)

# Compute the number of features in beech_buffer
length(beech_buffers)
## [1] 27
# Plot the tree buffers
plot(beech_buffers)

# Dissolve the buffers
beech_buf_union <- st_union(beech_buffers)

# Compute the number of features in beech_buf_union
length(beech_buf_union)
## [1] 1
# Plot the dissolved buffers
plot(beech_buf_union)

# A more precise bounding polygon is sometimes needed, one that fits your data more neatly
# For this, you can use the st_convex_hull() function
# Note that st_convex_hull() will compute a tight box around each one of your features individually so if you want to create a convex hull around a 
# group of features you'll need to use st_union() to combine individual features into a single multi-feature

# Look at the data frame to see the type of geometry
head(beech)
## Simple feature collection with 6 features and 7 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   tree_id  nta longitude latitude stump_diam        species      boroname
## 1  182961 QN72 -73.90401 40.76897          0 European beech        Queens
## 2  163271 MN31 -73.95473 40.77224          0 European beech     Manhattan
## 3  221707 QN06 -73.79034 40.72523          0 American beech        Queens
## 4   16250 BX10 -73.78911 40.84684          0 European beech         Bronx
## 5  183728 SI25 -74.12843 40.56829          0 American beech Staten Island
## 6  591657 QN43 -73.71567 40.73604          0 American beech        Queens
##                     geometry
## 1 POINT (-73.90401 40.76897)
## 2 POINT (-73.95473 40.77224)
## 3 POINT (-73.79034 40.72523)
## 4 POINT (-73.78911 40.84684)
## 5 POINT (-74.12843 40.56829)
## 6 POINT (-73.71567 40.73604)
# Convert the points to a single multi-point
beech1 <- st_union(beech)

# Look at the data frame to see the type of geometry
head(beech1)
## Geometry set for 1 feature 
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: -74.17746 ymin: 40.54247 xmax: -73.70872 ymax: 40.85696
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
## MULTIPOINT (-74.17746 40.54247, -74.13941 40.61...
# Confirm that we went from 17 features to 1 feature
length(beech)
## [1] 8
length(beech1)
## [1] 1
# Compute the tight bounding box
beech_hull <- st_convex_hull(beech1)

# Plot the points together with the hull
plot(beech_hull, col = "red")
plot(beech1, add = TRUE)

# For many analysis types you need to link geographies spatially
# For example, you want to know how many trees are in each neighborhood but you don't have a neighborhood attribute in the tree data
# The best way to do this is with a spatial join using st_join()

# Importantly, the st_join() function requires sf data frames as input and will not accept an object that is just sf geometry
# You can use the st_sf() function to convert sf geometry objects to an sf data frame (st_sf() is essentially the opposite of st_geometry())

# Plot the beech on top of the neighborhoods
plot(st_geometry(neighborhoods))
plot(beech$geometry, add = TRUE, pch = 16, col = "red")

# Determine whether beech has class data.frame
class(beech)
## [1] "sf"         "data.frame"
# Convert the beech geometry to a sf data frame
beech_df <- st_sf(beech)

# Confirm that beech now has the data.frame class
class(beech_df)
## [1] "sf"         "data.frame"
# Join the beech trees with the neighborhoods
beech_neigh <- st_join(beech_df, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Confirm that beech_neigh has the neighborhood information
head(beech_neigh)
## Simple feature collection with 6 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   tree_id  nta longitude latitude stump_diam        species      boroname
## 1  182961 QN72 -73.90401 40.76897          0 European beech        Queens
## 2  163271 MN31 -73.95473 40.77224          0 European beech     Manhattan
## 3  221707 QN06 -73.79034 40.72523          0 American beech        Queens
## 4   16250 BX10 -73.78911 40.84684          0 European beech         Bronx
## 5  183728 SI25 -74.12843 40.56829          0 American beech Staten Island
## 6  591657 QN43 -73.71567 40.73604          0 American beech        Queens
##   county_fip     boro_name ntacode                             ntaname
## 1        081        Queens    QN72                            Steinway
## 2        061     Manhattan    MN31         Lenox Hill-Roosevelt Island
## 3        081        Queens    QN06          Jamaica Estates-Holliswood
## 4        005         Bronx    BX10 Pelham Bay-Country Club-City Island
## 5        085 Staten Island    SI25               Oakwood-Oakwood Beach
## 6        081        Queens    QN43                           Bellerose
##   boro_code                   geometry
## 1         4 POINT (-73.90401 40.76897)
## 2         1 POINT (-73.95473 40.77224)
## 3         4 POINT (-73.79034 40.72523)
## 4         2 POINT (-73.78911 40.84684)
## 5         5 POINT (-74.12843 40.56829)
## 6         4 POINT (-73.71567 40.73604)
# In this exercise you will determine which neighborhoods are at least partly within 2000 meters of the Empire State Building with st_intersects() 
# and those that are completely within 2000 meters of the Empire State Building using st_contains()
# You will then use the st_intersection() function (notice the slight difference in function name!) to clip the neighborhoods to the buffer

# A note about the output of functions that test relationships between two sets of features
# The output of these and related functions is a special kind of list (with the class sgbp)
# For example, when using st_intersects(), the first element in the output can be accessed using [[1]], which shows polygons from the second polygon that intersect with the first polygon
# Likewise, [[2]] would show the polygons from from the first polygon that intersect with the second polygon

# Review df
df
##                       place longitude latitude
## 1     Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
df_mod <- df %>% filter(place == "Empire State Building")
df_sf_mod <- st_as_sf(df_mod, coords = c("longitude", "latitude"), crs=4326)
df_crs_mod <- st_transform(df_sf_mod, crs = crs(manhattan, asText = TRUE))
buf_mod <- st_buffer(df_crs_mod, dist = 2000)
buf <- st_transform(buf_mod, "+proj=longlat +ellps=WGS84 +no_defs")


# Identify neighborhoods that intersect with the buffer
neighborhoods_int <- st_intersects(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Identify neighborhoods contained by the buffer
neighborhoods_cont <- st_contains(buf, neighborhoods)
## although coordinates are longitude/latitude, st_contains assumes that they are planar
# Get the indexes of which neighborhoods intersect
# and are contained by the buffer
int <- neighborhoods_int[[1]]
cont <- neighborhoods_cont[[1]]

# Get the names of the names of neighborhoods in buffer
neighborhoods$ntaname[int]
## [1] Clinton                                   
## [2] Midtown-Midtown South                     
## [3] Turtle Bay-East Midtown                   
## [4] Murray Hill-Kips Bay                      
## [5] Gramercy                                  
## [6] Hudson Yards-Chelsea-Flatiron-Union Square
## [7] West Village                              
## [8] Stuyvesant Town-Cooper Village            
## [9] East Village                              
## 195 Levels: Airport ... Yorkville
# Clip the neighborhood layer by the buffer (ignore the warning)
neighborhoods_clip <- st_intersection(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
# Plot the geometry of the clipped neighborhoods
plot(st_geometry(neighborhoods_clip), col = "red")
plot(neighborhoods[cont,]$geometry, add = TRUE, col = "yellow")

# Of course, measuring distance between feature sets is a component of spatial analysis 101 -- a core skill for any analyst
# There are several functions in base R as well as in the packages rgeos and geosphere to compute distances, but the st_distance() function from sf 
# provides a useful feature-to-feature distance matrix as output and can be used for most distance calculation needs

# In this exercise you'll measure the distance from the Empire State Building to all the parks and identify the closest one

# Read in the parks object (done previously)
# parks <- st_read("parks.shp")

empire_state <- df_crs_mod
str(empire_state)
## Classes 'sf' and 'data.frame':   1 obs. of  2 variables:
##  $ place   : Factor w/ 2 levels "Empire State Building",..: 1
##  $ geometry:sfc_POINT of length 1; first list element: Classes 'XY', 'POINT', 'sfg'  num [1:2] 585632 4511327
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "place"
# Test whether the CRS match
st_crs(empire_state) == st_crs(parks)
## [1] FALSE
# Project parks to match empire state
parks_es <- st_transform(parks, crs = st_crs(empire_state))

# Compute the distance between empire_state and parks_es
d <- st_distance(empire_state, parks_es)

# Take a quick look at the result
head(d)
## Units: m
## [1]  3055.791 19969.336 22737.903 13846.358  4604.069 18541.779
# Find the index of the nearest park
nearest <- which.min(d)

# Identify the park that is nearest
parks_es[nearest, ]
## Simple feature collection with 1 feature and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 585392 ymin: 4511320 xmax: 585418.2 ymax: 4511379
## epsg (SRID):    26918
## proj4string:    +proj=utm +zone=18 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##                                                       location communityb
## 188 Broadway, Av of the Americas, bet. W. 32 St. and W. 33 St.        105
##     nys_senate            signname zipcode us_congres gispropnum borough
## 188         27 Greeley Square Park   10001         12       M032       M
##     waterfront nys_assemb councildis acres     typecatego      address
## 188         No         75          3 0.144 Triangle/Plaza 894 6 AVENUE
##                           geometry
## 188 MULTIPOLYGON (((585411 4511...
# Mask and crop are similar operations that allow you to limit your raster to a specific area of interest
# With mask() you essentially place your area of interest on top of the raster and any raster cells outside of the boundary are assigned NA values
# A reminder that currently the raster package does not support sf objects so they will need to be converted to Spatial objects with, for example, as(input, "Spatial").

# Project parks to match canopy
parks_cp <- st_transform(parks, crs = crs(canopy, asText = TRUE))

# Compute the area of the parks
areas <- st_area(parks_cp)

# Filter to parks with areas > 30000
parks_big <- filter(parks_cp, unclass(areas) > 30000)

# Plot the canopy raster
plot(canopy)

# Plot the geometry of parks_big
plot(st_geometry(parks_big))

# Convert parks to a Spatial object
parks_sp <- as(parks_big, "Spatial")

# Mask the canopy layer with parks_sp and save as canopy_mask
canopy_mask <- mask(canopy, mask = parks_sp)

# Plot canopy_mask -- this is a raster!
plot(canopy_mask)


# As you saw in the previous exercise with mask(), the raster extent is not changed
# If the extents of the input raster and the mask itself are different then they will still be different after running mask()
# In many cases, however, you will want your raster to share an extent with another layer and this is where crop() comes in handy
# With crop() you are cropping the raster so that the extent (the bounding box) of the raster matches the extent of the input crop layer
# But within the bounding box no masking is done (no raster cells are set to NA)

# In this exercise you will both mask and crop the NYC canopy layer based on the large parks and you'll compare
# You should notice that the masked raster includes a lot of NA values (there are the whitespace) and that the extent is the same as the original canopy layer
# With the cropped layer you should notice that the extent of the cropped canopy layer matches the extent of the large parks (essentially it's zoomed in)

# Convert the parks_big layer (this is preloaded, it has been limited to large parks and projected) to a Spatial object with as() -- call this parks_sp

# Convert the parks_big to a Spatial object
parks_sp <- as(parks_big, "Spatial")

# Mask the canopy with the large parks 
canopy_mask <- mask(canopy, mask = parks_sp)

# Plot the mask
plot(canopy_mask)

# Crop canopy with parks_sp
canopy_crop <- crop(canopy, parks_sp)

# Plot the cropped version and compare
plot(canopy_crop)

# Beyond simply masking and cropping you may want to know the actual cell values at locations of interest
# You might, for example, want to know the percentage canopy at your landmarks or within the large parks
# This is where the extract() function comes in handy

# Usefully, and you'll see this in a later analysis, you can feed extract() a function that will get applied to extracted cells
# For example, you can use extract() to extract raster values by neighborhood and with the fun = mean argument it will return an average cell value by neighborhood

# Similar to other raster functions, it is not yet set up to accept sf objects so you'll need to convert to a Spatial object

# Project the landmarks to match canopy
# landmarks_cp <- st_transform(landmarks, crs = crs(canopy, asText = TRUE))

# Convert the landmarks to a Spatial object
# landmarks_sp <- as(landmarks_cp, "Spatial")

# Extract the canopy values at the landmarks
# landmarks_ex <- extract(canopy, landmarks_sp)

# Look at the landmarks and extraction results
# landmarks_cp
# landmarks_ex


# You will now use the canopy layer and an "imperviousness" layer from the same source, the United States Geological Survey
# Imperviousness measures whether water can pass through a surface
# So a high percentage impervious surface might be a road that does not let water pass through while a low percentage impervious might be something like a lawn

# What you will do in this exercise is essentially identify the most urban locations by finding areas that have both a low percentage of tree canopy (< 20%) and high percentage of impervious (> 80%)
# To do this, we defined the function f to do the raster math for you

# Read in the canopy (already read in) and impervious layer
# canopy <- raster("canopy.tif")
impervious <- raster("./RInputFiles/ZIP Files/impervious/impervious.tif")

# Function f with 2 arguments and the raster math code
f <- function(rast1, rast2) {
  rast1 < 20 & rast2 > 80
}

# Do the overlay using f as fun
canopy_imperv_overlay <- overlay(canopy, impervious, fun = f)

# Plot the result (low tree canopy and high impervious areas)
plot(canopy_imperv_overlay)

# raster masks dplyr::select
detach("package:raster")

Chapter 4 - Combine Skills in Mini-Analysis

Compute tree density and average tree canopy by neighborhood:

  • Street tree census (each point is a tree) - note that Central Park will have none, so it is not a street
  • Tree canopy data (each grid cell is a percentage of canopy)
  • Question is whether these greenery measures are correlated by NYC neighborhood
    • Tree density by neighborhood (total trees divided by total area)
    • Average tree canopy by neighborhood (average all grid cells per neighborhood)

First look at results with ggplot2:

  • Note that ggplot2 and tmap, while more complex, make much nicer maps than plot()
    • The geom_sf() object is available in ggplot2, and does not require any aesthetic mappings if you just want to view the geometries
    • To make a chloropleth (areas filled by color), add aes(fill=) to the original ggplot call
    • Can improve the color scheme using scale_fill_gradient(low=, high=) # low and high each being quoted strings referencing an R color
  • As a reminder, it is generally best to plot using a projected CRS

Create final, polished maps with tmap:

  • The basic usage for tmap() is like ggplot(), and it begins with tm_shape() with layers added by plus signs
    • tm_polygons()
    • tm_rgb()
    • tm_lines()
    • tm_bubbles() or tm_dots()
  • Can add multiple data layers for the same data by just adding to the plus sign
  • To use a different layer, add a call to tm_shape(newData); so, for example, you might have
    • tm_shape(firstData) + tm(polygons() + tm_shape(secondData) + tm_borders()
  • For a chloropleth map, use the col=“myVar” option inside the specified layer
  • Can use tmap_arrange(maps…, nrow=) to get all the maps shown on the same pane, and with a specified number of nrows

Wrap up:

  • Geoprocessing allows for automated and integrated processing
  • The sf package makes use of data frames, which better integrated the process (similar to how raster works)
  • Raster data defaults to being left on disk, though with access available as needed
  • The CRS is particularly importanr
    • st_crs() and crs() to get or set the CRS
    • st_transform() or projectRaster() to change the CRS
  • Can then run spatial analyses for sf and raster projects, as well as plotting them
    • plot(), plotRGB(), ggplot2, tmap

Example code includes:

library(raster)
## 
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
## 
##     shift
## The following object is masked from 'package:qdapRegex':
## 
##     bind
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:colorspace':
## 
##     RGB
## The following objects are masked from 'package:spatstat':
## 
##     area, rotate, shift
## The following object is masked from 'package:nlme':
## 
##     getData
## The following object is masked from 'package:dplyr':
## 
##     select
# In order to compute tree density by neighborhood you need two things
# You will need to know the area of the neighborhoods, which you will compute in the next exercise
# And you need the tree counts by neighborhood which is the focus of this exercise

# You will produce counts of all trees by neighborhood in NYC and create a single data frame with a column for total trees
# The result should be a data frame with no geometry

# sf and dplyr are loaded in the workspace

# Compute the counts of all trees by hood (nta)
tree_counts <- count(trees, nta)

# Take a quick look
head(tree_counts)
## Simple feature collection with 6 features and 2 fields
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: -74.00396 ymin: 40.57265 xmax: -73.92026 ymax: 40.70249
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
##   nta       n                       geometry
##   <fct> <int>         <sf_geometry [degree]>
## 1 BK09    174 MULTIPOINT (-73.99901 40.69...
## 2 BK17    499 MULTIPOINT (-73.95982 40.58...
## 3 BK19    132 MULTIPOINT (-73.97331 40.57...
## 4 BK21    136 MULTIPOINT (-74.00396 40.58...
## 5 BK23     53 MULTIPOINT (-73.98038 40.57...
## 6 BK25    396 MULTIPOINT (-73.9718 40.607...
# Remove the geometry
tree_counts_no_geom <- st_set_geometry(tree_counts, NULL)

# Rename the n variable to tree_cnt
tree_counts_renamed <- rename(tree_counts_no_geom, tree_cnt = n)
  
# Create histograms of the total counts
hist(tree_counts_renamed$tree_cnt)

# We have the tree counts (from the previous exercise)
# In this exercise you will compute neighborhood areas, add them to the neighborhood sf object and then you'll join in the non-spatial tree counts data frame from the previous exercise

# Compute areas and unclass
areas <- unclass(st_area(neighborhoods))

# Add the areas to the neighborhoods object
neighborhoods_area <- mutate(neighborhoods, area = areas)

# Join neighborhoods and counts
neighborhoods_counts <- left_join(neighborhoods_area, tree_counts_renamed, by = c("ntacode"="nta"))
## Warning: Column `ntacode`/`nta` joining factors with different levels,
## coercing to character vector
# Replace NA values with 0
neighborhoods_counts <- mutate(neighborhoods_counts, 
                            tree_cnt = ifelse(is.na(tree_cnt), 
                                              0, tree_cnt))

# Compute the density
neighborhoods_counts <- mutate(neighborhoods_counts, 
                               tree_density = tree_cnt/area)


# In the previous exercises you computed tree density by neighborhood using tree counts
# In this exercise you will compute average tree canopy by neighborhood as a percentage so that we can compare if the results are similar

# Confirm that you have the neighborhood density results
head(neighborhoods_counts)
## Simple feature collection with 6 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -74.00736 ymin: 40.61264 xmax: -73.77574 ymax: 40.8355
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
##   county_fip boro_name ntacode            ntaname boro_code    area
## 1        047  Brooklyn    BK88       Borough Park         3 5017229
## 2        081    Queens    QN52      East Flushing         4 2736433
## 3        081    Queens    QN48         Auburndale         4 3173995
## 4        081    Queens    QN51        Murray Hill         4 4876380
## 5        081    Queens    QN27      East Elmhurst         4 1832715
## 6        005     Bronx    BX35 Morrisania-Melrose         2 1569317
##   tree_cnt tree_density                       geometry
## 1      565 0.0001126120 MULTIPOLYGON (((-73.97605 4...
## 2      295 0.0001078046 MULTIPOLYGON (((-73.79493 4...
## 3      507 0.0001597356 MULTIPOLYGON (((-73.77574 4...
## 4      732 0.0001501114 MULTIPOLYGON (((-73.80379 4...
## 5      211 0.0001151297 MULTIPOLYGON (((-73.8611 40...
## 6      214 0.0001363650 MULTIPOLYGON (((-73.89697 4...
# Transform the neighborhoods CRS to match the canopy layer
neighborhoods_crs <- st_transform(neighborhoods_counts, crs = crs(canopy, asText = TRUE))

# Convert neighborhoods object to a Spatial object
neighborhoods_sp <- as(neighborhoods_crs, "Spatial")

# Compute the mean of canopy values by neighborhood
canopy_neighborhoods <- extract(canopy_small, neighborhoods_sp, fun = mean)

# Add the mean canopy values to neighborhoods
neighborhoods_avg_canopy <- mutate(neighborhoods_counts, avg_canopy = as.vector(canopy_neighborhoods))


# Create a histogram of tree density (tree_density)
ggplot(neighborhoods_avg_canopy, aes(x = tree_density)) + 
  geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create a histogram of average canopy (avg_canopy)
ggplot(neighborhoods_avg_canopy, aes(x = avg_canopy)) + 
  geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create a scatter plot of tree_density vs avg_canopy
ggplot(neighborhoods_avg_canopy, aes(x = tree_density, y = avg_canopy)) + 
    geom_point() + 
    stat_smooth()
## `geom_smooth()` using method = 'loess'

# Compute the correlation between density and canopy
cor(neighborhoods_avg_canopy$tree_density, neighborhoods_avg_canopy$avg_canopy)
## [1] -0.08667411
# The geom_sf() function operates like any other layer in ggplot2 where you can link variables to aesthetics on the plot through the aes() function
# In a mapping context this might mean, for example, creating a choropleth map by color coding the polygons based on a variable
# If you leave off the aesthetic mapping geom_sf() will map the geometry alone

# Note: geom_sf() is still in the development version of ggplot2 on GitHub. If you want to use geom_sf() on your machine, you need to install the dev version
# devtools::install_github("tidyverse/ggplot2")

# Plot the tree density with default colors
# ggplot(neighborhoods_avg_canopy) + 
#   geom_sf(aes(fill = tree_density))

# Plot the tree canopy with default colors
# ggplot(neighborhoods) + 
#   geom_sf(aes(fill = avg_canopy))
  
# Plot the tree density using scale_fill_gradient()
# ggplot(neighborhoods) + 
#   geom_sf(aes(fill = tree_density)) + 
#   scale_fill_gradient(low = "#edf8e9", high = "#005a32")

# Plot the tree canopy using the scale_fill_gradient()
# ggplot(neighborhoods) + 
#   geom_sf(aes(fill = avg_canopy)) +
#   scale_fill_gradient(low = "#edf8e9", high = "#005a32")


# Create a simple map of neighborhoods
library(tmap)

tm_shape(neighborhoods_avg_canopy) + 
    tm_polygons()

# Create a color-coded map of neighborhood tree density
tm_shape(neighborhoods_avg_canopy) + 
    tm_polygons(col="tree_density")

# Style the tree density map
tm_shape(neighborhoods_avg_canopy) + 
    tm_polygons("tree_density", palette = "Greens", 
        style = "quantile", n = 7, 
        title = "Trees per sq. KM")

# Create a similar map of average tree canopy
tm_shape(neighborhoods_avg_canopy) + 
    tm_polygons("avg_canopy", palette = "Greens", 
        style = "quantile", n = 7, 
        title = "Average tree canopy (%)")

# Create a map of the manhattan aerial photo
tm_shape(manhattan) + 
    tm_rgb()

# Create a map of the neighborhood polygons
tm_shape(neighborhoods_avg_canopy) + 
    tm_borders(col = "black", lwd = 0.5, alpha = 0.5)

# Combine the aerial photo and neighborhoods into one map
map1 <- tm_shape(manhattan) + 
    tm_rgb() + 
    tm_shape(neighborhoods_avg_canopy) + 
    tm_borders(col = "black", lwd = 0.5, alpha = 0.5)

# Create the second map of tree measures (bbox causing errors . . . )
# map2 <- tm_shape(neighborhoods_avg_canopy, bbox = bbox(manhattan)) +
map2 <- tm_shape(neighborhoods_avg_canopy) +
        tm_polygons(c("tree_density", "avg_canopy"), 
        style = "quantile",
        palette = "Greens",
        title = c("Tree Density", "Average Tree Canopy"))

# Combine the two maps into one
tmap_arrange(map1, map2, asp = NA)

# raster masks dplyr::select
detach("package:raster")

Sentiment Analysis in R

Chapter 1 - Fast and Dirty - Polarity Scoring

Sentiment Analysis and Feelings:

  • Sentiment analysis is the process of extracting emotional intent from text - happy, sad, surprised, angry
  • Emotion is often important metadata - sentiment analysis helps to extract this from data
    • By contrast, bag-of-words is just gathering the words as vectors, without regard to their sentiment
    • Polarity is defined as the degree of positivity and negativity in text

Zipf’s Law, Number of Words, Subjectivity Lexicon:

  • A subjectivity lexicon is a predefined list of words associated with emotional contect such as positive/negative, frustration, joy, etc.
    • The qdap::polarity() uses a lexicon from hash_sentiment_huliu
  • The tidytext had a sentiments tibble with three different lexicons
    • NRC - words according to 8 emotions like “angry” or “joy” or Pos/Neg
    • Bing - words labelled positive or negative
    • AFINN - words scored from -5 to 5
  • There is also a lexicon package with a number of additional pre-defined lexicons
  • While the average person may know up to 50,000 words, the pre-defined lexicon typically have at most a few thousand words
    • Zipf’s Law - frequency of any given word is inversely proportional to its rank (e.g., third most frequent word is used about 33% as much as first-ranked word)
    • Principle of Least Effort - if there are multiple ways to achieve the same goal, people generally choose the easiest way (e.g., people tend to be lazy in how they express themselves)

Explore qdap - Polarity and Lexicon:

  • The default lexicon for polarity() comes from UIC researchers - roughly 7,000 words classified as positive or negative
    • For each word that polarity() identifies, it looks at the four words before and the two words after to form a context cluster
    • Therefore, removing stop words WILL change the polarity score, since the cluster will change
    • The words in the context cluster are classified as amplifier, negater, de-amplifier, etc.
    • The amplifiers and de-amplifier are considered valence shifters - words that effect the emotional context by way of additive or subtractive
    • The negator switches the polarity of the context cluster
    • Positive words default to +1 and negative words default to -1 (prior to any negation or amplification)
    • Amplifiers get a specified weight - for example, “very” is 0.8 if positive and -0.8 if negative
    • The final polarity score is divided by the square root of the number of words
  • Can adjust the lexicon to be channel specific, so that it works properly for the specific context needed

Example code includes:

# Call the libraries in a non-cached chunk
library(magrittr)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## 
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
## 
##     %+%
## The following object is masked from 'package:dplyr':
## 
##     explain
## Loading required package: qdapTools
## 
## Attaching package: 'qdapTools'
## The following object is masked from 'package:spatstat':
## 
##     shift
## The following object is masked from 'package:dplyr':
## 
##     id
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:magrittr':
## 
##     %>%
## The following object is masked from 'package:sf':
## 
##     %>%
## The following object is masked from 'package:dplyr':
## 
##     %>%
## The following object is masked from 'package:base':
## 
##     Filter
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
## 
##     ngrams
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

Followed by:

# We created text_df representing a conversation with person and text columns

# Use qdap's polarity() function to score text_df
# polarity() will accept a single character object or data frame with a grouping variable to calculate a positive or negative score

# In this example you will use the magrittr package's dollar pipe operator %$%
# The dollar sign forwards the data frame into polarity() and you declare a text column name or the text column and a grouping variable without quotes
# text_data_frame %$% polarity(text_column_name)

# To create an object with the dollar sign operator:
# polarity_object <- text_data_frame %$% 
#     polarity(text_column_name, grouping_column_name)

# More specifically, to make a quantitative judgement about the sentiment of some text, you need to give it a score
# A simple method is a positive or negative value related to a sentence, passage or a collection of documents called a corpus
# Scoring with positive or negative values only is called "polarity."
# A useful function for extracting polarity scores is counts() applied to the polarity object
# For a quick visual call plot() on the polarity() outcome

# From http://magrittr.tidyverse.org/
# Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function
# There are also functions that do not have a data argument, for which it is useful to expose the variables in the data
# This is done with the %$% operator
# iris %>%
#   subset(Sepal.Length > mean(Sepal.Length)) %$%
#   cor(Sepal.Length, Sepal.Width)

library(magrittr)
library(qdap)

text_df <- data.frame(
    person=c('Nick', 'Jonathan', 'Martijn', 'Nicole', 'Nick', 'Jonathan', 'Martijn', 'Nicole'), 
    text=c('DataCamp courses are the best', 'I like talking to students', 'Other online data science curricula are boring.', 'What is for lunch?', 'DataCamp has lots of great content!', 'Students are passionate and are excited to learn', 'Other data science curriculum is hard to learn and difficult to understand', 'I think the food here is good.'), 
    stringsAsFactors=TRUE
)

# Examine the text data
text_df
##     person
## 1     Nick
## 2 Jonathan
## 3  Martijn
## 4   Nicole
## 5     Nick
## 6 Jonathan
## 7  Martijn
## 8   Nicole
##                                                                         text
## 1                                              DataCamp courses are the best
## 2                                                 I like talking to students
## 3                            Other online data science curricula are boring.
## 4                                                         What is for lunch?
## 5                                        DataCamp has lots of great content!
## 6                           Students are passionate and are excited to learn
## 7 Other data science curriculum is hard to learn and difficult to understand
## 8                                             I think the food here is good.
# Calc overall polarity score
text_df %$% qdap::polarity(text)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               8          54        0.179       0.452              0.396
# Calc polarity score by person
(datacamp_conversation <- text_df %$% qdap::polarity(text, person))
##     person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan               2          13        0.577       0.184              3.141
## 2  Martijn               2          19       -0.478       0.141             -3.388
## 3     Nick               2          11        0.428       0.028             15.524
## 4   Nicole               2          11        0.189       0.267              0.707
# Counts table from datacamp_conversation
qdap::counts(datacamp_conversation)
##     person wc polarity           pos.words       neg.words                                                                   text.var
## 1     Nick  5    0.447                best               -                                              DataCamp courses are the best
## 2 Jonathan  5    0.447                like               -                                                 I like talking to students
## 3  Martijn  7   -0.378                   -          boring                            Other online data science curricula are boring.
## 4   Nicole  4    0.000                   -               -                                                         What is for lunch?
## 5     Nick  6    0.408               great               -                                        DataCamp has lots of great content!
## 6 Jonathan  8    0.707 passionate, excited               -                           Students are passionate and are excited to learn
## 7  Martijn 12   -0.577                   - hard, difficult Other data science curriculum is hard to learn and difficult to understand
## 8   Nicole  7    0.378                good               -                                             I think the food here is good.
# Plot the conversation polarity
plot(datacamp_conversation)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

# In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text
# To recap, one way to create a corpus is with the functions below
# Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful
# Turn a character vector into a text source using VectorSource().
# Turn a text source into a corpus using VCorpus().
# Remove unwanted characters from the corpus using cleaning functions like removePunctuation() and stripWhitespace() from tm, and replace_abbreviation() from qdap

# In this exercise a custom clean_corpus() function has been created using standard preprocessing functions for easier application
# clean_corpus() accepts the output of VCorpus() and applies cleaning functions. For example:
# processed_corpus <- clean_corpus(my_corpus)

library(tm)

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
}

# Your R session has a text vector, tm_define, containing two small documents and the function clean_corpus().
tm_define <- c("Text mining is the process of distilling actionable insights from text.", "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject.")

# clean_corpus(), tm_define are pre-defined
clean_corpus
## function(corpus){
##   corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
##   corpus <- tm_map(corpus, removePunctuation)
##   corpus <- tm_map(corpus, removeNumbers)
##   corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
##   corpus <- tm_map(corpus, content_transformer(tolower))
##   corpus <- tm_map(corpus, stripWhitespace)
##   return(corpus)
## }
tm_define
## [1] "Text mining is the process of distilling actionable insights from text."                          
## [2] "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject."
# Create a VectorSource
tm_vector <- VectorSource(tm_define)

# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)

# Examine the first document's contents
content(tm_corpus[[1]])
## [1] "Text mining is the process of distilling actionable insights from text."
# Clean the text
tm_clean <- clean_corpus(tm_corpus)

# Reexamine the contents of the first doc
content(tm_clean[[1]])
## [1] "text mining process distilling actionable insights text"
# Now let's create a Document Term Matrix (DTM). In a DTM
# Each row of the matrix represents a document.
# Each column is a unique word token.
# Values of the matrix correspond to an individual document's word usage

# The DTM is the basis for many bag of words analyses
# Later in the course, you will also use the related Term Document Matrix (TDM)
# This is the transpose; that is, columns represent documents and rows represent unique word tokens

# You should construct a DTM after cleaning the corpus (using clean_corpus())
# To do so, call DocumentTermMatrix() on the corpus object
# tm_dtm <- DocumentTermMatrix(tm_clean)

# If you need a more in-depth refresher check out the Text Mining: Bag of Words course
# Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!

# We've created a VCorpus() object called clean_text containing 1000 tweets mentioning coffee
# The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it

# clean_text is pre-defined (do not have VCorpus)
# clean_text

# Create tf_dtm
# tf_dtm <- DocumentTermMatrix(clean_text)

# Create tf_dtm_m
# tf_dtm_m <- as.matrix(tf_dtm)

# Dimensions of DTM matrix
# dim(tf_dtm_m)

# Subset part of tf_dtm_m for comparison
# tf_dtm_m[16:20, 2975:2985]


# Although Zipf observed a steep and predictable decline in word usage you may not buy into Zipf's law
# You may be thinking "I know plenty of words, and have a distinctive vocabulary"
# That may be the case, but the same can't be said for most people!
# To prove it, let's construct a visual from 3 million tweets mentioning "#sb"
# Keep in mind that the visual doesn't follow Zipf's law perfectly, the tweets all mentioned the same hashtag so it is a bit skewed
# That said, the visual you will make follows a steep decline showing a small lexical diversity among the millions of tweets
# So there is some science behind using lexicons for natural language analysis!

# In this exercise, you will use the package metricsgraphics
# Although the author suggests using the pipe %>% operator, you will construct the graphic step-by-step to learn about the various aspects of the plot
# The main function of the package metricsgraphics is the mjs_plot() function which is the first step in creating a JavaScript plot
# Once you have that, you can add other layers on top of the plot

# An example metricsgraphics workflow without using the %>% operator is below
# metro_plot <- mjs_plot(data, x = x_axis_name, y = y_axis_name, show_rollover_text = FALSE)
# metro_plot <- mjs_line(metro_plot)
# metro_plot <- mjs_add_line(metro_plot, line_one_values)
# metro_plot <- mjs_add_legend(metro_plot, legend = c('names', 'more_names'))
# metro_plot


rawWords <- c('sb', 'rt', 'the', 'to', 'a', 'for', 'esurancesweepstakes', 'you', 'broncos', 'esurance', 'in', 'is', 'of', 'on', 'win', 'and', 'panthers', 'nfl', 'i', 'super', 'at', 'with', 'bowl', 'this', 'your', 'superbowl', 'it', 'are', 'keeppounding', 'that', 'be', 'will', 'k', 'game', 'amp', 'we', 'our', 'my', 'got', 'cam', 'https\205', 'big', 'if', 'but', 'from', 'just', 'time', 'now', 'all', 'have', 'up', 'who', 'out', 'show', 'sbfanvote', 'peyton', 'so', 'chance', 'was', 'why', 'watch', 'see', 'like', 'winning', 'not', 'commercial', 'get', 'by', 'coldplay', 'more', 'think', 'what', 'go', 'one', 'do', 'over', 'here', 'halftime', 'away', 'good', 'me', 'gaga', 'lady', 'i\222ve', 'how', 'httpstco\205', 'ready', 'manning', 'pepsihalftime', 'could', 'wearing', 'ad', 'during', 'its', 'about', 'beyonce', 'doritos', 'httpst\205', 'an', 'day', 'going', 'anthem', 'after', 'national', 'than', 'team', 'want', 'gonna', 'some', 'his', 'denver', 'best', 'ladygaga', 'can', 'im', 'pass', 'today', 'enter', 'socks', 'avosinspace', 'shoes', 'sandals', 'reporter', 'jeans', 'biggame', 'httpstcoqdraydnsb', 'brunomars', 'tomorrow', 'sweepstakes', 'check', 'when', 'avosfrommexico', 'beyonc\351', 'sunday', 'great', 'seo', 'mvp', 'performance', 'as', 'love', 'they', 'new', 'field', 'did', 'congrats', 'tmobile', 'still', 'no', 'drake', 'tonight', 'special', 'yougotcarriered', 'he', 'last', 'has', 'too', 'superbowlsunday', 'lets', 'make')
rawFreq <- c(1984423, 1700564, 1101899, 588803, 428598, 388390, 326464, 322154, 296673, 292468, 266847, 265392, 245718, 234509, 233618, 233157, 215919, 212620, 202765, 183808, 182673, 176209, 175996, 172636, 146487, 143345, 142812, 136649, 134436, 130056, 128878, 126930, 116187, 115213, 114805, 108680, 103023, 88247, 88099, 84442, 82291, 82116, 79843, 78986, 77616, 77562, 75405, 73245, 70581, 68565, 68325, 66217, 66030, 64489, 63026, 62986, 62878, 61111, 60982, 59658, 59629, 57424, 56911, 56585, 56455, 56182, 55496, 55237, 54729, 53962, 52840, 50489, 46303, 46216, 45832, 45569, 44364, 43338, 42667, 42008, 41743, 41566, 41473, 40003, 39888, 39808, 39421, 38575, 38498, 37085, 35345, 32997, 31292, 31018, 30832, 29258, 29183, 28980, 28908, 27361, 27283, 23367, 23183, 22575, 22456, 21964, 21095, 20530, 20213, 19514, 19428, 19115, 18887, 18483, 18120, 16901, 14239, 14110, 13475, 13424, 13329, 13326, 13304, 13231, 13221, 13194, 12641, 12225, 11635, 11502, 11362, 11341, 11293, 11102, 10986, 10660, 10637, 10331, 10136, 10040, 9963, 9745, 9616, 9495, 9468, 9397, 9384, 9368, 9284, 8914, 8732, 8719, 8697, 8629, 8536, 8379, 8316, 7977, 7970)
rawRank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159)

sb_words <- data.frame(word=rawWords, freq=rawFreq, rank=rawRank)

# Examine sb_words
head(sb_words)
##   word    freq rank
## 1   sb 1984423    1
## 2   rt 1700564    2
## 3  the 1101899    3
## 4   to  588803    4
## 5    a  428598    5
## 6  for  388390    6
# Create expectations
# sb_words$expectations <- sb_words %$%
#   {freq / rank}

# Probably should be something more like this
sb_words$expectations <- sb_words %$% 
  {freq[1] / rank}

# Create metrics plot
sb_plot <- metricsgraphics::mjs_plot(sb_words, x = rank, y = freq, show_rollover_text = FALSE)

# Add 1st line
sb_plot <- metricsgraphics::mjs_line(sb_plot)

# Add 2nd line
sb_plot <- metricsgraphics::mjs_add_line(sb_plot, expectations)

# Add legend
sb_plot <- metricsgraphics::mjs_add_legend(sb_plot, legend = c("Frequency", "Expectation"))

# Display plot
sb_plot
# So far you have learned the basic components needed for assessing positive or negative intent in text
# Remember the following points so you can feel confident in your results.
# The subjectivity lexicon is a predefined list of words associated with emotions or positive/negative feelings.
# You don't have to list every word in a subjectivity lexicon because Zipf's law describes human expression.

# A quick way to get started is to use the polarity() function which has a built-in subjectivity lexicon

# The function scans the text to identify words in the lexicon
# It then creates a word group around the identified positive or negative subjectivity word
# Within the group valence shifters adjust the score
# Valence shifters are words that amplify or negate the emotional intent of the subjectivity word
# For example, "well known" is positive while "not well known" is negative
# Here "not" is a negating term and reverses the emotional intent of "well known."
# In contrast, "very well known" employs an amplifier increasing the positive intent

# The polarity() function then calculates a score using subjectivity terms, valence shifters and the total number of words in the passage
# This exercise demonstrates a simple polarity calculation
# In the next video we look under the hood of polarity() for more detail

# We've defined positive to denote a positive statement

# Example statements
positive <- "DataCamp courses are good for learning"

# Calculate polarity of both statements
(pos_score <- polarity(positive))
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               1           6        0.408          NA                 NA
# Get counts
(pos_counts <- counts(pos_score))
##   all wc polarity pos.words neg.words                               text.var
## 1 all  6    0.408      good         - DataCamp courses are good for learning
# Number of positive words
n_good <- length(pos_counts$pos.words[[1]])
  
# Total number of words
n_words <- pos_counts$wc
  
# Verify polarity score
n_good / sqrt(n_words)
## [1] 0.4082483
# Of course just positive and negative words aren't enough
# In this exercise you will learn about valence shifters which tell you about the author's emotional intent
# Previously you applied polarity() to text without valence shifters. In this example you will see amplifers and negating words in action

# Recall that an amplifying word adds 0.8 to a positive word in polarity() so the positive score becomes 1.8
# For negative words 0.8 is subtracted so the total becomes -1.8
# Then the score is divided by the square root of the total number of words

# Consider the following example from Frank Sinatra: "It was a very good year"
# "Good" equals 1 and "very" adds another 0.8
# So, 1.8/sqrt(6) results in 0.73 polarity

# A negating word such as "not" will inverse the subjectivity score
# Consider the following example from Bobby McFerrin: "Don't worry Be Happy"
# "worry is now 1 due to the negation "don't."
# Adding the "happy", +1, equals 2
# With 4 total words, 2 / sqrt(4) equals a polarity score of 1

conversation <- data.frame(student=c('Martijn', 'Nick', 'Nicole'), 
                           text=c('This restaurant is never bad', 'The lunch was very good', 'It was awful I got food poisoning and was extremely ill'),
                           stringsAsFactors=TRUE
                           )
# Examine conversation
conversation
##   student                                                    text
## 1 Martijn                            This restaurant is never bad
## 2    Nick                                 The lunch was very good
## 3  Nicole It was awful I got food poisoning and was extremely ill
# Polarity - All
polarity(conversation$text)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               3          21        0.317       0.565              0.561
# Polarity - Grouped
student_pol <- conversation %$%
  polarity(text, student)

# Student results
scores(student_pol)
##   student total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Martijn               1           5        0.447          NA                 NA
## 2    Nick               1           5        0.805          NA                 NA
## 3  Nicole               1          11       -0.302          NA                 NA
# Sentence by sentence
counts(student_pol)
##   student wc polarity pos.words neg.words                                                text.var
## 1 Martijn  5    0.447         -       bad                            This restaurant is never bad
## 2    Nick  5    0.805      good         -                                 The lunch was very good
## 3  Nicole 11   -0.302         -     awful It was awful I got food poisoning and was extremely ill
# qdap plot
plot(student_pol)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: Removed 3 rows containing missing values (geom_errorbarh).

# Even with Zipf's law in action, you will still need to adjust lexicons to fit the text source (for example twitter versus legal documents) or the author's demographics (teenage girl versus middle aged man)
# This exercise demonstrates the explicit components of polarity() so you can change it if needed

# In Trey Songz "Lol :)" song there is a lyric "LOL smiley face, LOL smiley face."
# In the basic polarity() function, "LOL" is not defined as positive
# However, "LOL" stands for "Laugh Out Loud" and should be positive
# As a result, you should adjust the lexicon to fit the text's context which includes pop-culture slang
# If your analysis contains text from a specific channel (Twitter's "LOL"), location (Boston's "Wicked Good"), or age group (teenagers "sick") you will likely have to adjust the lexicon

# In this exercise you are not adjusting the subjectivity lexicon or qdap dictionaries containing valence shifters
# Instead you are examining the existing word data frame objects so you can change them in the following exercise

# We've created text containing two excerpts from Beyoncé's "Crazy in Love" lyrics for the exercise

# Examine the key.pol
key.pol
##                x  y
##    1:     a plus  1
##    2:   abnormal -1
##    3:    abolish -1
##    4: abominable -1
##    5: abominably -1
##   ---              
## 6775:  zealously -1
## 6776:     zenith  1
## 6777:       zest  1
## 6778:      zippy  1
## 6779:     zombie -1
# Negators
negation.words
##  [1] "ain't"     "aren't"    "can't"     "couldn't"  "didn't"   
##  [6] "doesn't"   "don't"     "hasn't"    "isn't"     "mightn't" 
## [11] "mustn't"   "neither"   "never"     "no"        "nobody"   
## [16] "nor"       "not"       "shan't"    "shouldn't" "wasn't"   
## [21] "weren't"   "won't"     "wouldn't"
# Amplifiers
amplification.words
##  [1] "acute"         "acutely"       "certain"       "certainly"    
##  [5] "colossal"      "colossally"    "deep"          "deeply"       
##  [9] "definite"      "definitely"    "enormous"      "enormously"   
## [13] "extreme"       "extremely"     "great"         "greatly"      
## [17] "heavily"       "heavy"         "high"          "highly"       
## [21] "huge"          "hugely"        "immense"       "immensely"    
## [25] "incalculable"  "incalculably"  "massive"       "massively"    
## [29] "more"          "particular"    "particularly"  "purpose"      
## [33] "purposely"     "quite"         "real"          "really"       
## [37] "serious"       "seriously"     "severe"        "severely"     
## [41] "significant"   "significantly" "sure"          "surely"       
## [45] "true"          "truly"         "vast"          "vastly"       
## [49] "very"
# De-amplifiers
deamplification.words
##  [1] "barely"       "faintly"      "few"          "hardly"      
##  [5] "little"       "only"         "rarely"       "seldom"      
##  [9] "slightly"     "sparesly"     "sporadically" "very few"    
## [13] "very little"
text <- data.frame(speaker=c("beyonce", "jay_z"),
                   words=c("I know I dont understand Just how your love can do what no one else can", "They cant figure him out they like hey, is he insane"),
                   stringsAsFactors=TRUE
                   )

# Examine
text
##   speaker
## 1 beyonce
## 2   jay_z
##                                                                     words
## 1 I know I dont understand Just how your love can do what no one else can
## 2                    They cant figure him out they like hey, is he insane
# Explicit polarity parameters
polarity(
  text.var       = text$words,
  grouping.var   = text$speaker,
  polarity.frame = key.pol,
  negators       = negation.words,
  amplifiers     = amplification.words,
  deamplifiers   = deamplification.words 
)
##   speaker total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 beyonce               1          16         0.25          NA                 NA
## 2   jay_z               1          11         0.00          NA                 NA
# Here you will adjust the negative words to account for the specific text. You will then compare the basic and custom polarity() scores

# A popular song from Twenty One Pilots is called "Stressed Out".
# If you scan the lyrics of this song, you will observe the song is about youthful nostalgia
# Overall, most people would say the polarity is negative
# Repeatedly the lyrics mention stress, fears and pretending

# Let's compare the song lyrics using the default subjectivity lexicon and also a custom one

# To start, you need to verify the key.pol subjectivity lexicon does not already have the term you want to add
# One way to check is with grep()
# The grep() function returns the row containing characters that match a search pattern
# data_frame[grep("search_pattern", data_frame$column), ]

# After verifying the slang or new word is not already in the key.pol lexicon you need to add it
# The code below uses sentiment_frame() to construct the new lexicon
# Within the code sentiment_frame() accepts the original positive word vector, positive.words
# Next, the original negative.words are concatenated to "smh" and "kappa", both considered negative slang
# Although you can declare the positive and negative weights, the default is 1 and -1 so they are not included below
# custom_pol <- sentiment_frame(positive.words, c(negative.words, "hate", "pain"))

stressed_out <- "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"

# stressed_out has been pre-defined
head(stressed_out)
## [1] "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# Basic lexicon score
polarity(stressed_out)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               1         518       -0.255          NA                 NA
# Check the subjectivity lexicon
key.pol[grep("stress", x)]
##                x  y
## 1:      distress -1
## 2:    distressed -1
## 3:   distressing -1
## 4: distressingly -1
## 5:      mistress -1
## 6:        stress -1
## 7:      stresses -1
## 8:     stressful -1
## 9:   stressfully -1
# New lexicon
custom_pol <- sentiment_frame(positive.words, c(negative.words, "stressed", "turn back"))

# Compare new score
polarity(stressed_out, polarity.frame = custom_pol)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               1         518       -0.826          NA                 NA

Chapter 2 - Sentiment Analysis the tidytext Way

Plutchik’s wheel of emotion, polarity vs. sentiment:

  • Sentiment is much more complex than positive or negative
  • One popular approach is to use Plutchik’s wheel, which captures eight core human emotions
    • The basic theory is that these eight emotions are evolutionary, with all other emotions being derivative of them
    • The wheel generally has similar emotions next to each other and different emotions across from each other
    • Many emotions are a mix of two items on the wheel - remorse being a mix of sadness and disgust, for example
    • Further, the emotions tend to be more intense as you move towards the center of the wheel
  • Kanjoya developed a more complex emotional framework for comparison, with much more diverse connections
    • Created by tracking user stories and interactions on experienceproject.com

Bing lexicon with inner join:

  • The join functions are generic and will accept any two tables (need not be a relational SQL-like process)
    • The “by” parameter should be specified - can be by.x and by.y if the variable names disagree
    • For the joins to work, the data will need to be tables rather than vectors - one word per row per data source
  • Can use many types of joins for analysis
    • The inner_join will find the words that are in both (for example) a text and a lexicon
    • The anti_join keeps only the words that do not appear in the table presented as the second argument
    • The anti_join is broadly the equivalent of the stop words approach in the tidyverse

AFINN and NRC methodologies in more detail:

  • AFINN contains words labeled by researchers, and given scores of -5 to +5 (there are no neutral or 0-value words in this list)
  • NRC was produced through crowd-sourcing, and contains positive, negative, and the Plutchik’s wheel
    • The book has been tidied such that every word is its own row, including data about which line it was found on
  • Can use dplyr::summarize() to get the scores by line
    • Can also use dplyr::filter() to get key results by a subset of interest

Example code includes:

# There is a growing number of "tidyverse" R packages
# The tidyverse is a collection of R packages that share common philosophies and are designed to work together
# This chapter covers some tidy functions to manipulate data
# In fact, in this exercise you will compare a DTM to a tidy text data frame called a tibble

# Within the tidyverse each observation is a single row in a data frame
# That makes working in different packages much easier since the fundamental data structure is the same
# Parts of this course borrow heavily from the tidytext package which uses this data organization

# To change a DTM to a tidy format use tidy() from the broom package.
# tidy_format <- tidy(Document_Term_Matrix)

# This exercise uses text from the Greek tragedy, Agamemnon
# Agamemnon is a story about marital infidelity and murder
# You can download a copy here (http://www.gutenberg.org/ebooks/14417?msg=welcome_stranger)

# We've already created a clean DTM called ag_dtm for this exercise.

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
  return(corpus)
}

agRawText <- readLines("./RInputFiles/pg14417.txt")
agSource <- VectorSource(agRawText[318:3430])
agCorpus <- VCorpus(agSource)
agClean <- clean_corpus(agCorpus)
ag_dtm <- DocumentTermMatrix(agClean)


# As matrix
ag_dtm_m <- as.matrix(ag_dtm)

# Examine line 2206 and columns 245:250 (edited to 2206 and 308:313)
ag_dtm_m[2206, 308:313]
##    bleed   bleeds    blent    bless blessã¨d blessing 
##        0        0        0        1        0        0
# Tidy up the DTM (function does not work here . . . )
# ag_tidy <- broom::tidy(ag_dtm)
ag_tidy <- tibble::tibble(document=ag_dtm$dimnames$Docs[ag_dtm$i], 
                          term=ag_dtm$dimnames$Terms[ag_dtm$j], 
                          count=ag_dtm$v
                          )

# Examine tidy with a word you saw
ag_tidy[824:828, ]
## # A tibble: 5 x 3
##   document term     count
##   <chr>    <chr>    <dbl>
## 1 234      bleeds    1.00
## 2 234      sleepeth  1.00
## 3 235      comes     1.00
## 4 235      will      1.00
## 5 235      wisdom    1.00
# So far you have used a single lexicon
# Now we will transition to using three, each measuring sentiment in different ways

# The tidytext package contains a data frame called sentiments
# The data frame contains over 23000 terms from three different subjectivity lexicons with corresponding information
# Here are some example rows from the sentiments data frame

# Notice the tidy format
# Each word is a row and NAs fill in columns that are not applicable
# The "AFINN" lexicon scores words from 5 to -5
# The "Bing" lexicon is the same lexicon used in qdap's polarity() function
# "Bing" words are only labeled as positive or negative
# The "NRC" lexicon has distinct emotional classes covering Plutchik's Wheel and positive and negative

# Subset to AFINN
afinn_lex <- tidytext::get_sentiments("afinn")

# Count AFINN scores
afinn_lex %>% 
  count(score)
## # A tibble: 11 x 2
##    score     n
##    <int> <int>
##  1    -5    16
##  2    -4    43
##  3    -3   264
##  4    -2   965
##  5    -1   309
##  6     0     1
##  7     1   208
##  8     2   448
##  9     3   172
## 10     4    45
## 11     5     5
# Subset to nrc
nrc_lex <- tidytext::get_sentiments("nrc")

# Print nrc_lex
nrc_lex
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
# Make the nrc counts object
nrc_counts <- nrc_lex %>% 
  count(sentiment)
        
# Barplot
ggplot(nrc_counts, aes(x = sentiment, y = n))+
  geom_bar(stat = "identity") +
  ggthemes::theme_gdocs()

# The Bing lexicon labels words as positive or negative
# The next three exercises let you interact with this specific lexicon
# Instead of using filter() to extract a lexicon this exercise uses get_sentiments() which accepts a string such as "afinn", "bing", "nrc", or "loughran"

# Now that you understand the basics of an inner join, let's apply this to the "Bing" lexicon
# Keep in mind the inner_join() function comes from dplyr and the sentiments object is from tidytext

# The inner join workflow:
# Obtain the correct lexicon using either filter() or get_sentiments().
# Pass the lexicon and the tidy text data to inner_join().
# In order for inner_join() to work there must be a shared column name. If there are no shared column names, declare them with an additional parameter, by equal to c with column names like below
# object <- x %>% 
#     inner_join(y, by = c("column_from_x" = "column_from_y")

# We've loaded ag_txt containing the first 100 lines from Agamemnon and ag_tidy which is the tidy version

ag_txt <- agRawText[agRawText != ""][1:100]

# Qdap polarity
polarity(ag_txt)
## Warning in polarity(ag_txt): 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all             100        1038       -0.093       0.364             -0.255
# Get Bing lexicon
bing <- tidytext::get_sentiments("bing")

# Join text to lexicon
ag_bing_words <- inner_join(ag_tidy, bing, by = c("term" = "word"))

# Examine
ag_bing_words
## # A tibble: 1,641 x 4
##    document term     count sentiment
##    <chr>    <chr>    <dbl> <chr>    
##  1 10       waste     1.00 negative 
##  2 11       respite   1.00 positive 
##  3 13       well      1.00 positive 
##  4 14       lonely    1.00 negative 
##  5 16       great     1.00 positive 
##  6 16       heavenly  1.00 positive 
##  7 22       dark      1.00 negative 
##  8 23       fear      1.00 negative 
##  9 24       warning   1.00 negative 
## 10 25       well      1.00 positive 
## # ... with 1,631 more rows
# Get counts by sentiment
ag_bing_words %>%
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   1033
## 2 positive    608
# The spread() function spreads a key-value pair across multiple columns
# In this case key is the sentiment and the values are the frequency of positive or negative terms for each line
# Using spread() changes the data so that each row now has positive and negative values, even if it is 0

# In this exercise, your R session has m_dick_tidy which contains the book Moby Dick and bing, containing the lexicon similar to the previous exercise

all_books <- readRDS("./RInputFiles/all_books.rds")
m_dick_tidy <- all_books[all_books$book=="moby_dick", c("term", "document", "count")]
m_dick_tidy
## # A tibble: 109,996 x 3
##    term            document count
##    <chr>           <chr>    <dbl>
##  1 chapter         2         1.00
##  2 loomings        2         1.00
##  3 agonever        5         1.00
##  4 call            5         1.00
##  5 ishmael         5         1.00
##  6 long            5         1.00
##  7 mind            5         1.00
##  8 preciselyhaving 5         1.00
##  9 some            5         1.00
## 10 years           5         1.00
## # ... with 109,986 more rows
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))

moby_lex_words <- moby_lex_words %>%
  # Set index to numeric document
  mutate(index = as.numeric(document))

moby_count <- moby_lex_words %>%
  # Count by sentiment, index
  count(sentiment, index)

# Examine the counts
moby_count
## # A tibble: 10,594 x 3
##    sentiment index     n
##    <chr>     <dbl> <int>
##  1 negative   9.00     1
##  2 negative  11.0      1
##  3 negative  22.0      1
##  4 negative  41.0      1
##  5 negative  42.0      2
##  6 negative  44.0      1
##  7 negative  56.0      1
##  8 negative  64.0      1
##  9 negative  66.0      1
## 10 negative  68.0      1
## # ... with 10,584 more rows
moby_spread <- moby_count %>%
  # Spread sentiments
  tidyr::spread(sentiment, n, fill = 0)

# Review the spread data
moby_spread
## # A tibble: 9,229 x 3
##    index negative positive
##    <dbl>    <dbl>    <dbl>
##  1  9.00     1.00     0   
##  2 11.0      1.00     0   
##  3 13.0      0        1.00
##  4 17.0      0        1.00
##  5 19.0      0        1.00
##  6 22.0      1.00     0   
##  7 24.0      0        1.00
##  8 25.0      0        1.00
##  9 31.0      0        2.00
## 10 35.0      0        2.00
## # ... with 9,219 more rows
# The last Bing lexicon exercise!
# We started with this lexicon since its similar to the results in Chapter 1
# In this exercise you will use the pipe operator (%>%) to create a timeline of the sentiment in Moby Dick
# In the end you will also create a simple visual following the code structure below
# The next chapter goes into more depth for visuals

# Your R session has moby as your text and bing as your lexicon
# After this exercise you should know Is Moby Dick a happy or sad book?

moby_polarity <- m_dick_tidy %>%
  mutate(index = as.numeric(document)) %>%
  # Inner join to lexicon
  inner_join(bing, by = c("term" = "word")) %>%
  # Count the sentiment scores
  count(sentiment, index) %>% 
  # Spread the sentiment into positive and negative columns
  tidyr::spread(sentiment, n, fill = 0) %>%
  # Add polarity column
  mutate(polarity = positive - negative)

# Plot polarity vs. index
ggplot(moby_polarity, aes(x=index, y=polarity)) + 
  # Add a smooth trend curve
  geom_smooth()
## `geom_smooth()` using method = 'gam'

# Now we transition to the AFINN lexicon
# The AFINN lexicon has numeric values from 5 to -5, not just positive or negative
# Unlike the Bing lexicon's sentiment, the AFINN lexicon's sentiment score column is called score

# As before, you apply inner_join() then count()
# Next, to sum the scores of each line, we use dplyr's group_by() and summarize() functions
# The group_by() function takes an existing data frame and converts it into a grouped data frame where operations are performed "by group"
# Then, the summarize() function lets you calculate a value for each group in your data frame using a function that aggregates data, like sum() or mean()
# So, in our case we can do something like
# data_frame %>% 
#     group_by(book_line) %>% 
#     summarize(total_score = sum(book_line))

# In the tidy version of Huckleberry Finn, line 9703 contains words "best", "ever", "fun", "life" and "spirit". "best" and "fun" have AFINN scores of 3 and 4 respectively
# After aggregating, line 9703 will have a total score of 7

# The afinn object contains the AFINN lexicon
# The huck object is a tidy version of Mark Twain's Adventures of Huckleberry Finn for analysis

# Line 5400 is All the loafers looked glad; I reckoned they was used to having fun out of Boggs
# Stopwords and punctuation have already been removed in the dataset

huck <- all_books[all_books$book=="huck_finn", c("term", "document", "count")] %>%
    mutate(document=as.numeric(document)) %>%
    rename(line=document)
huck
## # A tibble: 55,198 x 3
##    term          line count
##    <chr>        <dbl> <dbl>
##  1 finn          1.00  1.00
##  2 ïhuckleberry  1.00  1.00
##  3 ago           3.00  1.00
##  4 fifty         3.00  1.00
##  5 forty         3.00  1.00
##  6 mississippi   3.00  1.00
##  7 scene         3.00  1.00
##  8 the           3.00  1.00
##  9 time          3.00  1.00
## 10 valley        3.00  1.00
## # ... with 55,188 more rows
# See abbreviated line 5400
huck %>% filter(line == 5400)
## # A tibble: 7 x 3
##   term      line count
##   <chr>    <dbl> <dbl>
## 1 all       5400  1.00
## 2 fun       5400  1.00
## 3 glad      5400  1.00
## 4 loafers   5400  1.00
## 5 looked    5400  1.00
## 6 reckoned  5400  1.00
## 7 used      5400  1.00
# What are the scores of the sentiment words?
afinn_lex %>% filter(word %in% c("fun", "glad"))
## # A tibble: 2 x 2
##   word  score
##   <chr> <int>
## 1 fun       4
## 2 glad      3
huck_afinn <- huck %>% 
  # Inner Join to AFINN lexicon
  inner_join(afinn_lex, by = c("term" = "word")) %>%
  # Count by score and line
  count(score, line)

huck_afinn_agg <- huck_afinn %>% 
  # Group by line
  group_by(line) %>%
  # Sum scores by line
  summarize(total_score = sum(score))

# Filter huck_afinn_agg
huck_afinn_agg %>% filter(line == 5400)
## # A tibble: 1 x 2
##    line total_score
##   <dbl>       <int>
## 1  5400           7
# Plot total score vs. line
ggplot(huck_afinn_agg, aes(x=line, y=total_score)) + 
  # Add a smooth trend curve
  geom_smooth()
## `geom_smooth()` using method = 'gam'

# Last but not least, you get to work with the NRC lexicon which labels words across multiple emotional states
# Remember Plutchik's wheel of emotion? The NRC lexicon tags words according to Plutchik's 8 emotions plus positive/negative

# In this exercise there is a new operator, %in%, which matches a vector to another
# In the code below %in% will return FALSE, FALSE, TRUE
# This is because within some_vec, 1 and 2 are not found within some_other_vector but 3 is found and returns TRUE
# The %in% is useful to find matches

# We've created oz which is the tidy version of The Wizard of Oz along with nrc containing the "NRC" lexicon with renamed columns
# Switched to Julius Caesar since it is what is easily available in the dataset

jc <- all_books[all_books$book=="julius_caesar", c("term", "document", "count")] %>%
    mutate(document=as.numeric(document)) %>%
    rename(line=document)
jc
## # A tibble: 13,165 x 3
##    term         line count
##    <chr>       <dbl> <dbl>
##  1 etext        1.00  1.00
##  2 file         1.00  1.00
##  3 gutenberg    1.00  1.00
##  4 ïthis        1.00  1.00
##  5 presented    1.00  1.00
##  6 project      1.00  1.00
##  7 cooperation  2.00  1.00
##  8 inc          2.00  1.00
##  9 library      2.00  2.00
## 10 world        2.00  1.00
## # ... with 13,155 more rows
# Join text and lexicon
jc_nrc <- inner_join(jc, nrc_lex, by = c("term" = "word"))

# DataFrame of tally
jc_plutchik <- jc_nrc %>% 
  # Only consider Plutchik sentiments
  filter(!sentiment %in% c("positive", "negative")) %>%
  # Group by sentiment
  group_by(sentiment) %>% 
  # Get total count by sentiment
  summarize(total_count = sum(count))

# Plot the counts
ggplot(jc_plutchik, aes(x = sentiment, y = total_count)) +
  # Add a column geom
  geom_col()


Chapter 3 - Visualizing Sentiment

Parlor trick or worthwhile?

  • Good visualizations can aid in helping decision-makers
    • In general, avoid word clouds which are extremely cliché
    • Bar charts can be just as informative for showing frequency counts
    • Be careful not to be redundant - no need for sentiment analysis on text where the customer already got to answer how satisfied they are, for example
  • Exercises will track sentiment over time - for example, Twitter mentions vs. new advertising
    • Example in this case will be scoring a book over line, to see whether it has a happy or a sad ending

Introduction using sentiment analysis:

  • Reminder that comparison clouds show words in one text but not in the other
    • Can use polarity as a pre-filter - divide the corpus in to positive and negative components
  • Exercise will inner join NRC and Moby Dick
    • Reminder that “or” is | in grepl() or grep()
    • Can create stacked bar charts of emotions, to better compare sentiment make-up by document

Interpreting visualizations:

  • Kernel density plot - good for understanding densities
    • Like a histogram, but with smoothed data - less likely to misinterpret the results due to poor choice of binning
  • Box plot - compare multiple sentiments simultaneously
    • Can be easier to compare multiple distributions - median, IQR, extra 1.5 * IQR length, outliers
  • Radar chart - similar to Plutchik’s wheel
    • Radar charts are also known as spider charts - idea of having multiple bars move in different directions, making for easier comparisons
  • Tree map
    • The size of each square represents a property such as size or volume
    • A second dimension (color) represents polarity or sentiment
    • Similar boxes (either by time or author or sentiment or etc.) can be placed close together

Example code includes:

# Sometimes you want to track sentiment over time
# For example, during an ad campaign you could track brand sentiment to see the campaign's effect
# You saw a few examples of this at the end of the last chapter

# In this exercise you'll recap the workflow for exploring sentiment over time using the novel Moby Dick
# One should expect that happy moments in the book would have more positive words than negative
# Conversely dark moments and sad endings should use more negative language
# You'll also see some tricks to make your sentiment time series more visually appealling

moby_polarity <- m_dick_tidy %>%
  mutate(index = as.numeric(document)) %>%
  # Inner join to the lexicon
  inner_join(bing, by=c("term" = "word")) %>%
  # Count by sentiment, index
  count(sentiment, index) %>%
  # Spread sentiments
  tidyr::spread(sentiment, n, fill=0) %>%
  mutate(
    # Add polarity field
    polarity = positive - negative,
    # Add line number field
    line_number = row_number()
  )

# Plot
ggplot(moby_polarity, aes(x=line_number, y=polarity)) + 
  geom_smooth() +
  geom_hline(yintercept = 0, color = "red") +
  ggtitle("Moby Dick Chronological Polarity") +
  ggthemes::theme_gdocs()
## `geom_smooth()` using method = 'gam'

# One of the easiest ways to explore data is with a frequency analysis
# Although not difficult, in sentiment analysis this simple method can be surprisingly illuminating
# Specifically, you will build a barplot. In this exercise you are once again working with moby and bing to construct your visual

# Inner join without renamed columns
moby_sents <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))

# Tidy sentiment calculation
moby_tidy_sentiment <- moby_sents %>% 
  count(term, sentiment, wt = count) %>%
  tidyr::spread(sentiment, n, fill = 0) %>%
  mutate(polarity = positive - negative)

# Review
moby_tidy_sentiment
## # A tibble: 2,362 x 4
##    term        negative positive polarity
##    <chr>          <dbl>    <dbl>    <dbl>
##  1 abominable      3.00     0       -3.00
##  2 abominate       1.00     0       -1.00
##  3 abomination     1.00     0       -1.00
##  4 abound          0        3.00     3.00
##  5 abruptly        2.00     0       -2.00
##  6 absence         5.00     0       -5.00
##  7 absurd          3.00     0       -3.00
##  8 absurdly        1.00     0       -1.00
##  9 abundance       0        3.00     3.00
## 10 abundant        0        2.00     2.00
## # ... with 2,352 more rows
# Subset
moby_tidy_small <- moby_tidy_sentiment %>% 
  filter(abs(polarity) >= 50)

# Add polarity
moby_tidy_pol <- moby_tidy_small %>% 
  mutate(
    pol = ifelse(polarity > 0, "positive", "negative")
  )

# Plot
ggplot(
  moby_tidy_pol, 
  aes(reorder(term, polarity), polarity, fill = pol)
) +
  geom_bar(stat = "identity") + 
  ggtitle("Moby Dick: Sentiment Word Frequency") + 
  ggthemes::theme_gdocs() +
  theme(axis.text.x = element_text(angle = 90, vjust = -0.1))

# Now that you have seen how polarity can be used to divide a corpus, let's do it!
# This code will walk you through dividing a corpus based on sentiment so you can peer into the informaton in subsets instead of holistically

# Your R session has oz_pol which was created by applying polarity() to "The Wonderful Wizard of Oz."

# For simplicity's sake, we created a simple custom function called pol_subsections() which will divide the corpus by polarity score
# First, the function accepts a data frame with each row being a sentence or document of the corpus
# The data frame is subset anywhere the polarity values are greater than or less than 0
# Finally, the positive and negative sentences, non-zero polarities, are pasted with parameter collapse so that the terms are grouped into a single corpus
# Lastly, the two documents are concatenated into a single vector of two distinct documents

pol_subsections <- function(df) {
  x.pos <- subset(df$text, df$polarity > 0)
  x.neg <- subset(df$text, df$polarity < 0)
  x.pos <- paste(x.pos, collapse = " ")
  x.neg <- paste(x.neg, collapse = " ")
  all.terms <- c(x.pos, x.neg)
  return(all.terms)
}

# At this point you have omitted the neutral sentences and want to focus on organizing the remaining text
# In this exercise we use the %>% operator again to forward objects to functions
# After some simple cleaning use comparison.cloud() to make the visual

# Using Agamemnon instead since easily available
ag_pol <- polarity(agRawText[318:3430])
## Warning in polarity(agRawText[318:3430]): 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.
# Add scores to each document line in a data frame
ag_df <- ag_pol$all %>%
  select(text = text.var, polarity = polarity)

# Custom function
all_terms <- pol_subsections(ag_df)

# Make a corpus
all_corpus <- all_terms %>%
  VectorSource() %>% 
  VCorpus()

# Basic TDM
all_tdm <- TermDocumentMatrix(
  all_corpus,
  control = list(
    removePunctuation = TRUE,
    stopwords = stopwords(kind = "en")
  )
) %>%
  as.matrix() %>%
  set_colnames(c("positive", "negative"))

# Make a comparison cloud
wordcloud::comparison.cloud(
  all_tdm,
  max.words = 50,
  colors = c("darkgreen", "darkred")
)

# In this exercise you go beyond subsetting on positive and negative language
# Instead you will subset text by each of the 8 emotions in Plutchik's emotional wheel to construct a visual
# With this approach you will get more clarity in word usage by mapping to a specific emotion instead of just positive or negative.

# Using the tidytext subjectivity lexicon, "nrc", you perform an inner_join() with your text
# The "nrc" lexicon has the 8 emotions plus positive and negative term classes
# So you will have to drop positive and negative words after performing your inner_join()
# One way to do so is with the negation, !, and grepl()

# The "Global Regular Expression Print Logical," grepl(), function will return a True or False if a string pattern is identified in each row
# In this exercise you will search for positive OR negative using the | operator, representing "or" as shown below
# Often this straight line is above the enter key on a keyboard
# Since the ! negation precedes grepl(), the T or F is switched so the "positive|negative" is dropped instead of kept

# Next you apply count() on the identified words along with spread() to get the data frame organized

# This exercise introduces rownames()
# This function declares the names of rows in a data frame
# It behaves a bit differently because rownames() is passed the object gaining the row names on the left side of <-
# On the right side the character vector of names is declared such as data_frame[, 1]. For instance:
# rownames(data_frame) <- vector_of_names

# After setting row names you will create a more varied comparison.cloud()

# NOTE - appears NRC is already converted to have 'term' rather than 'word'

# Inner join
moby_sentiment <- inner_join(m_dick_tidy, nrc_lex, by = c("term" = "word"))

# Drop positive or negative
moby_pos_neg <- moby_sentiment %>%
  filter(!grepl("positive|negative", sentiment))

# Count terms by sentiment then spread 
moby_tidy <- moby_pos_neg %>% 
  count(sentiment, term = term) %>% 
  tidyr::spread(sentiment, n, fill = 0) %>%
  as.data.frame()
  
# Set row names
rownames(moby_tidy) <- moby_tidy[, 1]

# Drop terms column
moby_tidy[, 1] <- NULL

# Examine
head(moby_tidy)
##             anger anticipation disgust fear joy sadness surprise trust
## abandon         0            0       0    3   0       3        0     0
## abandoned       7            0       0    7   0       7        0     0
## abandonment     2            0       0    2   0       2        2     0
## abhorrent       1            0       1    1   0       0        0     0
## abominable      0            0       3    3   0       0        0     0
## abomination     1            0       1    1   0       0        0     0
# Comparison cloud
wordcloud::comparison.cloud(moby_tidy, max.words = 50, title.size = 1.5)

# Another way to slice your text is to understand how much of the document(s) are made of positive or negative words
# For example a restaurant review may have some positive aspects such as "the food was good" but then continue to add "the restaurant was dirty, the staff was rude and parking was awful."
# As a result, you may want to understand how much of a document is dedicated to positive vs negative language
# In this example it would have a higher negative percentage compared to positive

# One method for doing so is to count() the positive and negative words then divide by the number of subjectivity words identified
# In the restaurant review example, "good" would count as 1 positive and "dirty," "rude," and "awful" count as 3 negative terms
# A simple calculation would lead you to believe the restaurant review is 25% positive and 75% negative since there were 4 subjectivity terms

# Start by performing the inner_join() on a unified tidy data frame containing 4 books, Agamemnon, Oz, Huck Finn, and Moby Dick
# Just like the previous exercise you will use filter() and grepl()

# To perform the count() you have to group the data by book and then sentiment
# For example all the positive words for Agamemnon have to be grouped then tallied so that positive words from all books are not mixed
# Luckily, you can pass multiple variables into count() directly

# Forward book_sents, which is the NRC inner join to all tidy books, to filter()

# Review tail of all_books
tail(all_books)
## # A tibble: 6 x 5
##   term       document count author book            
##   <chr>      <chr>    <dbl> <chr>  <chr>           
## 1 ebooks     19117     1.00 twain  innocents_abroad
## 2 email      19117     1.00 twain  innocents_abroad
## 3 hear       19117     1.00 twain  innocents_abroad
## 4 new        19117     1.00 twain  innocents_abroad
## 5 newsletter 19117     1.00 twain  innocents_abroad
## 6 subscribe  19117     1.00 twain  innocents_abroad
# Inner join
books_sents <- inner_join(all_books, nrc_lex, by=c("term"="word"))

# Keep only positive or negative
books_pos_neg <- books_sents %>%
  filter(grepl("positive|negative", sentiment))

# Review tail again
tail(books_pos_neg)
## # A tibble: 6 x 6
##   term        document count author book             sentiment
##   <chr>       <chr>    <dbl> <chr>  <chr>            <chr>    
## 1 included    19106     1.00 twain  innocents_abroad positive 
## 2 compliance  19107     1.00 twain  innocents_abroad positive 
## 3 main        19110     1.00 twain  innocents_abroad positive 
## 4 information 19114     1.00 twain  innocents_abroad positive 
## 5 including   19115     1.00 twain  innocents_abroad positive 
## 6 foundation  19116     1.00 twain  innocents_abroad positive
# Count by book & sentiment
books_sent_count <- books_pos_neg %>%
  count(book, sentiment)

# Review entire object
books_sent_count
## # A tibble: 22 x 3
##    book           sentiment     n
##    <chr>          <chr>     <int>
##  1 bartleby       negative    537
##  2 bartleby       positive    864
##  3 confidence_man negative   3561
##  4 confidence_man positive   5899
##  5 ct_yankee      negative   4048
##  6 ct_yankee      positive   6154
##  7 hamlet         negative   1677
##  8 hamlet         positive   2250
##  9 huck_finn      negative   2471
## 10 huck_finn      positive   3544
## # ... with 12 more rows
# Split, make proportional
book_pos <- books_sent_count %>%
  group_by(book) %>% 
  mutate(percent_positive = n / sum(n) * 100)

# Proportional bar plot
ggplot(book_pos, aes(x = book, y = percent_positive, fill = sentiment)) +  
  geom_bar(stat = "identity")

# We've loaded ag as a tidy version of Agamemnon and created afinn as a subset of the tidytext "afinn" lexicon
# Agamemnon inner join
ag_afinn <- inner_join(ag_tidy, afinn_lex, by=c("term"="word")) %>%
    mutate(line=as.numeric(document)) %>%
    select(-document)

# Add book
ag_afinn$book <- "agamemnon"

# Oz inner join (use jc instead)
jc_afinn <- inner_join(jc, afinn_lex, by=c("term"="word"))

# Add book
jc_afinn$book <- "jc"

# Combine
all_df <- rbind(ag_afinn, jc_afinn)

# Plot 2 densities
ggplot(all_df, aes(x = score, fill = book)) + 
  geom_density(alpha = 0.3) + 
  ggthemes::theme_gdocs() +
  ggtitle("AFINN Score Densities")

# In this exercise the all_book_polarity object is already loaded
# The data frame contains two columns, book and polarity
# It comprises all books with qdap's polarity() function applied

all_book_polarity <- readRDS("./RInputFiles/all_book_polarity.rds")

# Examine
str(all_book_polarity)
## 'data.frame':    14437 obs. of  2 variables:
##  $ book    : Factor w/ 4 levels "huck","agamemnon",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ polarity: num  0.277 0.258 -0.577 0.25 0.516 ...
# Summary by document
tapply(all_book_polarity$polarity, all_book_polarity$book, FUN=summary)
## $huck
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.38700 -0.25820  0.23570  0.04156  0.26730  1.60400 
## 
## $agamemnon
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4670 -0.3780 -0.3333 -0.1266  0.3333  1.2250 
## 
## $moby
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.13300 -0.28870 -0.25000 -0.02524  0.28870  1.84800 
## 
## $oz
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2730 -0.2774  0.2582  0.0454  0.2887  1.1880
# Box plot
ggplot(all_book_polarity, aes(x = book, y = polarity)) +
  geom_boxplot(fill = c("#bada55", "#F00B42", "#F001ED", "#BA6E15"), col = "darkred") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.02) +
  ggthemes::theme_gdocs() +
  ggtitle("Book Polarity")

# Remember Plutchik's wheel of emotion?
# The NRC lexicon has the 8 emotions corresponding to the first ring of the wheel
# Previously you created a comparison.cloud() according to the 8 primary emotions
# Now you will create a radar chart similar to the wheel in this exercise

# A radarchart is a two-dimensional representation of multidimensional data (at least 3)
# In this case the tally of the different emotions for a book are represented in the chart
# Using a radar chart, you can review all 8 emotions simultaneously

# As before we've loaded the "nrc" lexicon as nrc and moby_huck which is a combined tidy version of both Moby Dick and Huck Finn

bindMoby <- m_dick_tidy %>%
    mutate(document=as.numeric(document), book="moby")
bindHuck <- huck %>%
    mutate(book="huck") %>%
    rename(document=line)
moby_huck <- rbind(bindMoby, bindHuck)
moby_huck
## # A tibble: 165,194 x 4
##    term            document count book 
##  * <chr>              <dbl> <dbl> <chr>
##  1 chapter             2.00  1.00 moby 
##  2 loomings            2.00  1.00 moby 
##  3 agonever            5.00  1.00 moby 
##  4 call                5.00  1.00 moby 
##  5 ishmael             5.00  1.00 moby 
##  6 long                5.00  1.00 moby 
##  7 mind                5.00  1.00 moby 
##  8 preciselyhaving     5.00  1.00 moby 
##  9 some                5.00  1.00 moby 
## 10 years               5.00  1.00 moby 
## # ... with 165,184 more rows
# Review tail of moby_huck
tail(moby_huck)
## # A tibble: 6 x 4
##   term       document count book 
##   <chr>         <dbl> <dbl> <chr>
## 1 subscribe     11788  1.00 huck 
## 2 ebooks        11789  1.00 huck 
## 3 email         11789  1.00 huck 
## 4 hear          11789  1.00 huck 
## 5 new           11789  1.00 huck 
## 6 newsletter    11789  1.00 huck
# Inner join
books_sents <- inner_join(moby_huck, nrc_lex, by=c("term"="word"))

# Drop positive or negative
books_pos_neg <- books_sents %>%
  filter(!grepl("positive|negative", sentiment))

# Tidy tally
books_tally <- books_pos_neg %>%
  group_by(book, sentiment) %>%
  tally()

# Key value pairs
scores <- books_tally %>%
  tidyr::spread(book, n) 
  
# Review scores
scores
## # A tibble: 8 x 3
##   sentiment     huck  moby
##   <chr>        <int> <int>
## 1 anger         1123  2811
## 2 anticipation  2214  4740
## 3 disgust        823  2025
## 4 fear          1332  4178
## 5 joy           1713  3175
## 6 sadness       1303  3393
## 7 surprise      1154  2153
## 8 trust         2191  5099
# JavaScript radar chart
radarchart::chartJSRadar(scores)
# Make the scores relatove to total
scoresRelative <- scores %>%
    mutate(huckRel = huck/sum(huck), mobyRel=moby/sum(moby))
scoresRelative
## # A tibble: 8 x 5
##   sentiment     huck  moby huckRel mobyRel
##   <chr>        <int> <int>   <dbl>   <dbl>
## 1 anger         1123  2811  0.0947  0.102 
## 2 anticipation  2214  4740  0.187   0.172 
## 3 disgust        823  2025  0.0694  0.0734
## 4 fear          1332  4178  0.112   0.152 
## 5 joy           1713  3175  0.145   0.115 
## 6 sadness       1303  3393  0.110   0.123 
## 7 surprise      1154  2153  0.0974  0.0781
## 8 trust         2191  5099  0.185   0.185
# JavaScript radar chart
radarchart::chartJSRadar(scoresRelative[, c("sentiment", "huckRel", "mobyRel")])
# Often you will find yourself working with documents in groups, such as author, product or by company
# This exercise lets you learn about the text while retaining the groups in a compact visual
# For example, with customer reviews grouped by product you may want to explore multiple dimensions of the customer reviews at the same time
# First you could calculate the polarity() of the reviews. Another dimension may be length
# Document length can demonstrate the emotional intensity
# If a customer leaves a short "great shoes!" one could infer they are actually less enthusiastic compared to a lengthier positive review
# You may also want to group reviews by product type such as women's, men's and children's shoes. A treemap lets you examine all of these dimensions

# For text analysis, within a treemap each individual box represents a document such as a tweet
# Documents are grouped in some manner such as author
# The size of each box is determined by a numeric value such as number of words or letters
# The individual colors are determined by a sentiment score

# After you organize the tibble, you use the treemap library containing the function treemap() to make the visual
# The code example below declares the data, grouping variables, size, color and other aesthetics
# treemap(data_frame,
#         index = c("group", "individual_document"),
#         vSize = "V1",
#         vColor = "avg_score",
#         type = "value",
#         title = "Book Sentiment Scores",
#         palette = c("red", "white", "green"))

# The pre-loaded all_books object contains a combined tidy format corpus with 4 Shakespeare, 3 Melville and 4 Twain books
# Based on the treemap you should be able to tell who writes longer books, and the polarity of the author as a whole and for individual books

books_score <- all_books %>% 
  # Inner join with AFINN scores
  inner_join(afinn_lex, by=c("term" = "word"))

book_length <- books_score %>% 
  # Count number of words per book
  count(book)

book_score <- books_score %>% 
  # Group by author, book
  group_by(author, book) %>%
  # Calculate mean book score
  summarize(mean_score = mean(score))

book_tree <- book_score %>% 
  # Inner join by book
  inner_join(book_length, by=c("book"))

# Examine the results
book_tree
## # A tibble: 11 x 4
## # Groups:   author [?]
##    author      book             mean_score     n
##    <chr>       <chr>                 <dbl> <int>
##  1 melville    bartleby             0.101    761
##  2 melville    confidence_man       0.506   5480
##  3 melville    moby_dick            0.161   8973
##  4 shakespeare hamlet               0.0984  2064
##  5 shakespeare julius_caesar        0.0846  1359
##  6 shakespeare macbeth              0.222    910
##  7 shakespeare romeo_juliet         0.175   1978
##  8 twain       ct_yankee            0.199   6083
##  9 twain       huck_finn            0.0763  4849
## 10 twain       innocents_abroad     0.405   8988
## 11 twain       tom_sawyer          -0.0265  3741
# Make the visual
treemap::treemap(book_tree,
        index = c("author", "book"),
        vSize = "n",
        vColor = "mean_score",
        type = "value",
        title = "Book Sentiment Scores",
        palette = c("red", "white", "green")
        )


Chapter 4 - Case Study: Airbnb

Refresher on text mining workflow:

  • Text mining is the process of moving from disorganized data to organized conclusions
    1. Define the project and specific goals
    2. Identfy the text to be analyzed
    3. Organize the text
    4. Feature extraction
    5. Analyze data artifacts - visuals, summary statistics, etc.
    6. Draw conclusions, and wrap back to the first step in the pipeline
  • Goal for this exercise is to look at rental properties to see how a specific property measures up
    • Find appropriate data sources
    • Be careful about the terms of service, including potential conflicts about web scraping

Organize and clean the text:

  • Polarity scoring is often a good starting point for EDA
  • Adding an original word order column is often a good idea (not always needed, but sometimes useful later)

Feature extraction and analysis:

  • Rental review polarity tends to skew positive - social pressure leads to grade inflation
  • There is a common trend where people write longer the more polarized their opinion
    • Goal is to demonstrate a relationship between polarity and autohor effort
  • Comparison clouds plot the terms that are not sharde - see positive vs. negative
  • Can also scale polarity scores using scale() - defaults to putting everything on mean 0, sd 1 (z score)

Draw conclusions:

  • Use the terms in the reviews to answer key questions
  • The final step in the workflow is to work through to a conclusion

Next steps:

  • qdap::polarity()
  • tidytext
  • inner_join
  • sentiment lexicons with get_sentiments()
  • comparison clouds

Example code includes:

# The Boston property rental reviews are stored in a CSV file located by the predefined variable bos_reviews_file

# bos_reviews_file has been pre-defined
# bos_reviews_file

# load raw text
# bos_reviews <- read.csv(bos_reviews_file, stringsAsFactors = FALSE)
bos_reviews <- readRDS("./RInputFiles/bos_reviews.rds")

# Structure
str(bos_reviews)
## 'data.frame':    1000 obs. of  2 variables:
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ comments: chr  "My daughter and I had a wonderful stay with Maura. She kept in close touch with us throughout the day as we weren't arriving ti"| __truncated__ "We stay at Elizabeth's place for 3 nights in October 2014.\nThe apartment is really a great place to stay. \nLovely decorated a"| __truncated__ "If you're staying in South Boston, this is a terrific place to camp out. The apartment and bedroom are lovely, Ellie is an exce"| __truncated__ "Derian and Brian were great and prompt with their communications with us. The room was as described; it was a small nice and cl"| __truncated__ ...
# Dimensions
dim(bos_reviews)
## [1] 1000    2
# Using a kernel density plot you should notice the reviews do not center on 0. Often there are two causes for this sentiment "grade inflation."
# First, social norms may lead respondents to be pleasant instead of neutral
# This, of course, is channel specific
# Particularly snarky channels like e-sports or social media posts may skew negative leading to "deflation."
# These channels have different expectations
# A second possible reason could be "feature based sentiment".
# In some reviews an author may write "the bed was comfortable and nice but the kitchen was dirty and gross."
# The sentiment of this type of review encompasses multiple features simultaneously and therefore could make an average score skewed

# In a subsequent exercise you will adjust this "grade inflation" but here explore the reviews without any change

# We've also loaded a larger polarity object for all 1000 comments
# This new object is called bos_pol
# Now apply summary() to the correct list element that returns all polarity scores of bos_pol

# Practice apply polarity to first 6 reviews
practice_pol <- polarity(bos_reviews$comments[1:6])
## Warning in polarity(bos_reviews$comments[1:6]): 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.
# Review the object
practice_pol
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               6         390        0.747       0.398              1.875
# Check out the practice polarity
summary(practice_pol$all$polarity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.5009  0.6594  0.7466  1.0780  1.2460
# Summary for all reviews
bos_pol <- polarity(bos_reviews$comments)
## Warning in polarity(bos_reviews$comments): 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.
summary(bos_pol$all$polarity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.9712  0.6047  0.8921  0.9022  1.2060  3.7510       1
# Plot it
ggplot(bos_pol$all, aes(x = polarity, y = ..density..)) +
  ggthemes::theme_gdocs() + 
  geom_histogram(binwidth = 0.25, fill = "#bada55", colour = "grey60") +
  geom_density(size = 0.75)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).

# In this exercise you will perform Step 3 of the text mining workflow
# Although qdap isn't a tidy package you will mutate() a new column based on the returned polarity list representing all polarity (that's a hint BTW) scores
# In chapter 3 we used a custom function pol_subsections which uses only base R declarations
# However, in following the tidy principles this exercise uses filter() then introduces pull()
# The pull() function works like works like [[ to extract a single variable

# Once segregated you collapse all the positive and negative comments into two larger documents representing all words among the positive and negative rental reviews

# Lastly, you will create a Term Frequency Inverse Document Frequency (TFIDF) weighted Term Document Matrix (TDM)
# Since this exercise code starts with a tidy structure, some of the functions borrowed from tm are used along with the %>% operator to keep the style consistent
# If the basics of the tm package aren't familiar check out the Text Mining: Bag of Words course
# Instead of counting the number of times a word is used (frequency), the values in the TDM are penalized for over used terms, which helps reduce non-informative words

# Review
bos_pol$group
##   all total.sentences total.words ave.polarity sd.polarity
## 1 all            1000       70481    0.9021735   0.5015318
##   stan.mean.polarity
## 1           1.798836
# Add polarity column
bos_reviews_with_pol <- bos_reviews %>% 
  mutate(polarity = bos_pol$all$polarity)

# Subset positive comments 
pos_comments <- bos_reviews_with_pol %>% 
  filter(polarity > 0) %>% 
  pull(comments)

# Subset negative comments
neg_comments <- bos_reviews_with_pol %>% 
  filter(polarity < 0) %>% 
  pull(comments)

# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")

# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")

# Concatenate the terms
all_terms <- c(pos_terms, neg_terms)

# Pipe a VectorSource Corpus
all_corpus <- all_terms %>% 
  VectorSource() %>% 
  VCorpus()

# Simple TFIDF TDM
all_tdm <- TermDocumentMatrix(
  all_corpus, 
  control = list(
    weighting = weightTfIdf, 
    removePunctuation = TRUE, 
    stopwords = stopwords(kind = "en")
  )
)

# Examine the TDM
all_tdm
## <<TermDocumentMatrix (terms: 4967, documents: 2)>>
## Non-/sparse entries: 4350/5584
## Sparsity           : 56%
## Maximal term length: 93
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
# Previously you learned that applying tidy() on a TermDocumentMatrix() object will convert the TDM to a tibble
# In this exercise you will create the word data directly from the review column called comments

# First you use unnest_tokens() to make the text lowercase and tokenize the reviews into single words

# Sometimes it is useful to capture the original word order within each group of a corpus
# To do so, use mutate(). In mutate() you will use seq_along() to create a sequence of numbers from 1 to the length of the object
# This will capture the word order as it was written

# In the tm package, you would use removeWords() to remove stopwords
# In the tidyverse you first need to load the stop words lexicon and then apply an anti_join() between the tidy text data frame and the stopwords

# Load the premade "SMART" stopwords to your R session with data("stop_words")

# Vector to tibble
tidy_reviews <- bos_reviews %>% 
  tidytext::unnest_tokens(word, comments)

# Group by and mutate
tidy_reviews <- tidy_reviews %>% 
  group_by(id) %>% 
  mutate(original_word_order = seq_along(word))

# Quick review
tidy_reviews
## # A tibble: 70,986 x 3
## # Groups:   id [1,000]
##       id word      original_word_order
##    <int> <chr>                   <int>
##  1     1 my                          1
##  2     1 daughter                    2
##  3     1 and                         3
##  4     1 i                           4
##  5     1 had                         5
##  6     1 a                           6
##  7     1 wonderful                   7
##  8     1 stay                        8
##  9     1 with                        9
## 10     1 maura                      10
## # ... with 70,976 more rows
# Load stopwords
data("stop_words", package="tidytext")

# Perform anti-join
tidy_reviews_without_stopwords <- tidy_reviews %>% 
  anti_join(stop_words)
## Joining, by = "word"
# Here you will learn that differing sentiment methods will cause different results
# Often you will simply need to have results align directionally although the specifics may be different
# In the last exercise you created tidy_reviews which is a data frame of rental reviews without stopwords
# Earlier in the chapter, you calculated and plotted qdap's basic polarity() function
# This showed you the reviews tend to be positive

# Now let's perform a similar analysis the tidytext way!
# Recall from an earlier chapter you will perform an inner_join() followed by count() and then a spread()

# Lastly, you will create a new column using mutate() and passing in positive - negative.

# Get the correct lexicon
bing <- tidytext::get_sentiments("bing")

# Calculate polarity for each review
pos_neg <- tidy_reviews_without_stopwords %>% 
  inner_join(bing, by=c("word")) %>%
  count(sentiment) %>%
  tidyr::spread(sentiment, n, fill = 0) %>% 
  mutate(polarity = positive - negative)

# Check outcome
summary(pos_neg)
##        id            negative          positive         polarity      
##  Min.   :   1.0   Min.   : 0.0000   Min.   : 0.000   Min.   :-11.000  
##  1st Qu.: 253.0   1st Qu.: 0.0000   1st Qu.: 3.000   1st Qu.:  2.000  
##  Median : 498.0   Median : 0.0000   Median : 4.000   Median :  4.000  
##  Mean   : 500.4   Mean   : 0.6128   Mean   : 4.965   Mean   :  4.353  
##  3rd Qu.: 748.0   3rd Qu.: 1.0000   3rd Qu.: 7.000   3rd Qu.:  6.000  
##  Max.   :1000.0   Max.   :14.0000   Max.   :28.000   Max.   : 26.000
# Often authors will use more words when they are more passionate
# For example, a mad airline passenger will leave a longer review the worse (the perceived) service
# Conversely a less impassioned passenger may not feel compelled to spend a lot of time writing a review
# Lengthy reviews may inflate overall sentiment since the reviews will inherently contain more positive or negative language as the review lengthens
# This coding exercise helps to examine effort and sentiment

# In this exercise you will visualize the relationship between effort and sentiment
# Recall your rental review tibble contains an id and that a word is represented in each row
# As a result a simple count() of the id will capture the number of words used in each review
# Then you will join this summary to the positive and negative data
# Ultimately you will create a scatter plot that will visualize author review length and its relationship to polarity

# tidy_reviews and pos_neg from the previous exercises are available in your workspace

# Review tidy_reviews
tidy_reviews_without_stopwords
## # A tibble: 26,247 x 3
## # Groups:   id [?]
##       id word      original_word_order
##    <int> <chr>                   <int>
##  1     1 daughter                    2
##  2     1 wonderful                   7
##  3     1 stay                        8
##  4     1 maura                      10
##  5     1 close                      14
##  6     1 touch                      15
##  7     1 day                        20
##  8     1 arriving                   24
##  9     1 til                        25
## 10     1 evening                    29
## # ... with 26,237 more rows
# Review pos_neg
pos_neg
## # A tibble: 953 x 4
## # Groups:   id [953]
##       id negative positive polarity
##    <int>    <dbl>    <dbl>    <dbl>
##  1     1     0        4.00     4.00
##  2     2     0        3.00     3.00
##  3     3     0        3.00     3.00
##  4     4     0        6.00     6.00
##  5     5     0        2.00     2.00
##  6     6     0        3.00     3.00
##  7     7     0        5.00     5.00
##  8     8     0        2.00     2.00
##  9     9     0        4.00     4.00
## 10    10     1.00    15.0     14.0 
## # ... with 943 more rows
# Create effort
effort <- tidy_reviews_without_stopwords %>%
  count(id)

# Inner join
pos_neg_with_effort <- pos_neg %>%
  inner_join(effort, by=c("id"))

# Review 
pos_neg_with_effort
## # A tibble: 953 x 5
## # Groups:   id [?]
##       id negative positive polarity     n
##    <int>    <dbl>    <dbl>    <dbl> <int>
##  1     1     0        4.00     4.00    26
##  2     2     0        3.00     3.00    27
##  3     3     0        3.00     3.00    16
##  4     4     0        6.00     6.00    32
##  5     5     0        2.00     2.00     8
##  6     6     0        3.00     3.00    21
##  7     7     0        5.00     5.00    18
##  8     8     0        2.00     2.00    10
##  9     9     0        4.00     4.00    12
## 10    10     1.00    15.0     14.0     46
## # ... with 943 more rows
# Add pol
pos_neg_pol <- pos_neg_with_effort %>%
  mutate(
    pol = ifelse(
      polarity >= 0, 
      "Positive", 
      "Negative"
    )
  )

# Plot
ggplot(pos_neg_pol, aes(polarity, n, color = pol)) + 
    geom_point(alpha = 0.25) +
    geom_smooth(method = "lm", se = FALSE) +
    ggthemes::theme_gdocs() + 
    ggtitle("Relationship between word effort & polarity")

# This exercise will create a common visual for you to understand term frequency
# Specifically, you will review the most frequent terms from among the positive and negative collapsed documents
# Recall the TermDocumentMatrix all_tdm you created earlier
# Instead of 1000 rental reviews the matrix contains 2 documents containing all reviews separated by the polarity() score

# It's usually easier to change the TDM to a matrix
# From there you simply rename the columns
# Remember that the colnames() function is called on the left side of the assignment operator as shown below
# colnames(OBJECT) <- c("COLUMN_NAME1", "COLUMN_NAME2")
# Once done, you will reorder the matrix to see the most positive and negative words. Review these terms so you can answer the conclusion exercises!
# Lastly, you'll visualize the terms using comparison.cloud().


# Matrix
all_tdm_m <- as.matrix(all_tdm)

# Column names
colnames(all_tdm_m) <- c("positive", "negative")

# Top pos words
order_by_pos <- order(all_tdm_m[, 1], decreasing = TRUE)

# Review top 10 pos words
all_tdm_m[order_by_pos, ] %>% head(n=10)
##              Docs
## Terms            positive negative
##   walk        0.004557696        0
##   definitely  0.004172956        0
##   staying     0.003729024        0
##   city        0.003285093        0
##   wonderful   0.003107520        0
##   restaurants 0.003048329        0
##   highly      0.002959543        0
##   station     0.002693184        0
##   enjoyed     0.002426825        0
##   subway      0.002397230        0
# Top neg words
order_by_neg <- order(all_tdm_m[, 2], decreasing = TRUE)

# Review top 10 neg words
all_tdm_m[order_by_neg, ] %>% head(n=10)
##               Docs
## Terms          positive    negative
##   condition           0 0.002159827
##   don´t               0 0.002159827
##   demand              0 0.001439885
##   disappointed        0 0.001439885
##   dumpsters           0 0.001439885
##   hygiene             0 0.001439885
##   inform              0 0.001439885
##   it´s                0 0.001439885
##   nasty               0 0.001439885
##   safety              0 0.001439885
# Get rid of non-alphanumeric (including weird punctuation)
delLines <- grepl(pattern="[^a-zA-Z\\d]", x=rownames(all_tdm_m))
sum(delLines)
## [1] 245
length(delLines)
## [1] 4967
# Comparison cloud
wordcloud::comparison.cloud(
  all_tdm_m[!delLines, ], 
  max.words = 20,
  colors = c("darkgreen","darkred")
)
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : sounds could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : speaking could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : unsafe could not be fit on page. It will not be plotted.

# Recall the "grade inflation" of polarity scores on the rental reviews?
# Sometimes, another way to uncover an insight is to scale the scores back to 0 then perform the corpus subset
# This means some of the previously positive comments may become part of the negative subsection or vice versa since the mean is changed to 0
# This exercise will help you scale the scores and then re-plot the comparison.cloud()
# Removing the "grade inflation" can help provide additional insights

# Previously you applied polarity() to the bos_reviews$comments and created a comparison.cloud()
# In this exercise you will scale() the outcome before creating the comparison.cloud()
# See if this shows something different in the visual!

# Review
bos_pol$all[1:6, 1:3]
##   all  wc  polarity
## 1 all  77 1.1851900
## 2 all  78 1.2455047
## 3 all  39 0.4803845
## 4 all 101 0.7562283
## 5 all  16 0.2500000
## 6 all  79 0.5625440
# Scale/center & append
bos_reviews$scaled_polarity <- scale(bos_pol$all$polarity)

# Subset positive comments
pos_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity > 0)

# Subset negative comments
neg_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity < 0)

# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")

# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")

# Organize
all_terms<- c(pos_terms, neg_terms)

# VCorpus
all_corpus <- VCorpus(VectorSource(all_terms))

# TDM
all_tdm <- TermDocumentMatrix(
  all_corpus, 
  control = list(
    weighting = weightTfIdf, 
    removePunctuation = TRUE, 
    stopwords = stopwords(kind = "en")
  )
)

# Column names
all_tdm_m <- as.matrix(all_tdm)
colnames(all_tdm_m) <- c("positive", "negative")

# Comparison cloud
wordcloud::comparison.cloud(
  all_tdm_m, 
  max.words = 40,
  colors = c("darkgreen", "darkred")
)
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : suggested could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : amazingly could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : appliances could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : luxurious could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : meetings could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : mentioned could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : particularly could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : phyllis could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : spaces could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : unique could not be fit on page. It will not be plotted.


Sentiment Analysis in R: The Tidy Way

Chapter 1 - Tweets Across the United States

Sentiment analysis and tidy tools:

  • Can apply the tidyverse to text mining and sentiment analysis
  • Sentiment lexicons are lists of words that have been scored by content of the word - binary, numeric, etc.
    • NRC - specific emotions
    • Bing -
    • Afinn -
  • Decisions about which lexicon to use typically depend on what questions you want to answer

Sentiment analysis via inner join:

  • A pre-processed dataset, geocoded_tweets, is available containing three columns
    • state (US state)
    • word
    • freq (average frequency for the word within the state)
  • Tidy data with one word per row can easily be conveted to sentiment analysis using an inner_join()

Using dplyr verbs to analysis sentiment analysis results:

  • The filter() verb is useful for just pulling specific rows
  • The group_by() verb is useful for defining groups within the dataset based on values of a specific variable(s)
  • The summarize() verb acts to create one value per group, based on functions specified inside the call
  • The arrange() verb will organize the results by a specific variable(s) - basically, a form of sort()
  • The ungroup() function will remove the groups from a data frame, which is often a valuable step after the summarizing is completed

Looking at differences by state:

  • Can filter by state and sentiment to see a sample of words used by state and sentiment
  • Can group by state and get summaries for all the states simultaneously
  • The spread() function converts from a long, skinny date frame to a short, wide data frame
    • Can run group_by() %>% summarize() %>% spread() %>% ungroup()

Example code includes:

# Choose the bing lexicon
tidytext::get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
# Choose the nrc lexicon
tidytext::get_sentiments("nrc") %>%
  count(sentiment) # Count words by sentiment
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231
# geocoded_tweets has been pre-defined
load("./RInputFiles/geocoded_tweets.rda")
geocoded_tweets
## # A tibble: 520,304 x 3
##    state   word          freq
##    <chr>   <chr>        <dbl>
##  1 alabama a         16256686
##  2 alabama a-            5491
##  3 alabama a-day         3992
##  4 alabama aa            4739
##  5 alabama aaliyah       8252
##  6 alabama aamu          4306
##  7 alabama aaron        19813
##  8 alabama ab           68032
##  9 alabama abandoned     4071
## 10 alabama abbeville     7153
## # ... with 520,294 more rows
# Access bing lexicon: bing
bing <- tidytext::get_sentiments("bing")

# Use data frame with text data
geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `bing`
  inner_join(bing, by=c("word"))
## # A tibble: 64,303 x 4
##    state   word            freq sentiment
##    <chr>   <chr>          <dbl> <chr>    
##  1 alabama abuse           7186 negative 
##  2 alabama abused          3073 negative 
##  3 alabama accomplish      5957 positive 
##  4 alabama accomplished   13121 positive 
##  5 alabama accomplishment  3036 positive 
##  6 alabama accurate       28262 positive 
##  7 alabama ache            7306 negative 
##  8 alabama aching          5080 negative 
##  9 alabama addict          5441 negative 
## 10 alabama addicted       40389 negative 
## # ... with 64,293 more rows
# Create the tweets_nrc data
nrc <- tidytext::get_sentiments("nrc")
tweets_nrc <- geocoded_tweets %>%
  inner_join(nrc, by=c("word"))


# tweets_nrc has been pre-defined
tweets_nrc
## # A tibble: 210,027 x 4
##    state   word       freq sentiment
##    <chr>   <chr>     <dbl> <chr>    
##  1 alabama abandoned  4071 anger    
##  2 alabama abandoned  4071 fear     
##  3 alabama abandoned  4071 negative 
##  4 alabama abandoned  4071 sadness  
##  5 alabama ability   12406 positive 
##  6 alabama abortion   3267 disgust  
##  7 alabama abortion   3267 fear     
##  8 alabama abortion   3267 negative 
##  9 alabama abortion   3267 sadness  
## 10 alabama absolute  22956 positive 
## # ... with 210,017 more rows
tweets_nrc %>%
  # Filter to only choose the words associated with sadness
  filter(sentiment == "sadness") %>%
  # Group by word
  group_by(word) %>%
  # Use the summarize verb to find the mean frequency
  summarize(freq = mean(freq)) %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))
## # A tibble: 585 x 2
##    word       freq
##    <chr>     <dbl>
##  1 hate    1253840
##  2 bad      984943
##  3 bitch    787774
##  4 hell     486259
##  5 crazy    447047
##  6 feeling  407562
##  7 leave    397806
##  8 mad      393559
##  9 music    373608
## 10 sick     362023
## # ... with 575 more rows
#  (If you are familiar with geom_bar(stat = "identity"), geom_col() does the same thing.)
# tweets_nrc has been pre-defined
# tweets_nrc

joy_words <- tweets_nrc %>%
  # Filter to choose only words associated with joy
  filter(sentiment == "joy") %>%
  # Group by each word
  group_by(word) %>%
  # Use the summarize verb to find the mean frequency
  summarize(freq = mean(freq)) %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))    

joy_words %>%
  top_n(20) %>%
  mutate(word = reorder(word, freq)) %>%
  # Use aes() to put words on the x-axis and frequency on the y-axis
  ggplot(aes(x=word, y=freq)) +
  # Make a bar chart with geom_col()
  geom_col() +
  coord_flip() 
## Selecting by freq

# tweets_nrc has been pre-defined
# tweets_nrc

tweets_nrc %>%
  # Find only the words for the state of Utah and associated with joy
  filter(state == "utah", sentiment == "joy") %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))
## # A tibble: 326 x 4
##    state word         freq sentiment
##    <chr> <chr>       <dbl> <chr>    
##  1 utah  love      4207322 joy      
##  2 utah  good      3035114 joy      
##  3 utah  happy     1402568 joy      
##  4 utah  pretty     902947 joy      
##  5 utah  fun        764045 joy      
##  6 utah  birthday   663439 joy      
##  7 utah  beautiful  653061 joy      
##  8 utah  friend     627522 joy      
##  9 utah  hope       571841 joy      
## 10 utah  god        536687 joy      
## # ... with 316 more rows
tweets_nrc %>%
  # Find only the words for the state of Louisiana and associated with joy
  filter(state == "louisiana", sentiment == "joy") %>%
  # Arrange to sort in order of descending frequency
  arrange(desc(freq))
## # A tibble: 290 x 4
##    state     word        freq sentiment
##    <chr>     <chr>      <dbl> <chr>    
##  1 louisiana love     3764157 joy      
##  2 louisiana good     2758699 joy      
##  3 louisiana baby     1184392 joy      
##  4 louisiana happy    1176291 joy      
##  5 louisiana god       882457 joy      
##  6 louisiana birthday  740497 joy      
##  7 louisiana money     677899 joy      
##  8 louisiana hope      675559 joy      
##  9 louisiana pretty    581242 joy      
## 10 louisiana feeling   486367 joy      
## # ... with 280 more rows
# For the last exercise in this chapter, you will determine how the overall sentiment of Twitter sentiment varies from state to state
# You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier
# Check out what tweets_bing looks like in the console


# Create the tweets_bing data
bing <- tidytext::get_sentiments("bing")
tweets_bing <- geocoded_tweets %>%
  inner_join(bing, by=c("word"))


# tweets_bing has been pre-defined
tweets_bing
## # A tibble: 64,303 x 4
##    state   word            freq sentiment
##    <chr>   <chr>          <dbl> <chr>    
##  1 alabama abuse           7186 negative 
##  2 alabama abused          3073 negative 
##  3 alabama accomplish      5957 positive 
##  4 alabama accomplished   13121 positive 
##  5 alabama accomplishment  3036 positive 
##  6 alabama accurate       28262 positive 
##  7 alabama ache            7306 negative 
##  8 alabama aching          5080 negative 
##  9 alabama addict          5441 negative 
## 10 alabama addicted       40389 negative 
## # ... with 64,293 more rows
tweets_bing %>% 
  # Group by two columns: state and sentiment
  group_by(state, sentiment) %>%
  # Use summarize to calculate the mean frequency for these groups
  summarize(freq = mean(freq)) %>%
  tidyr::spread(sentiment, freq) %>%
  ungroup() %>%
  # Calculate the ratio of positive to negative words
  mutate(ratio = positive / negative,
         state = reorder(state, ratio)) %>%
  # Use aes() to put state on the x-axis and ratio on the y-axis
  ggplot(aes(x=state, y=ratio)) +
  # Make a plot with points using geom_point()
  geom_point() +
  coord_flip()


Chapter 2 - Shakespeare Gets Sentimental

Tidying Shakespeare plays:

  • The shakespeare tibble is available with fields title-type-text
    • The text is in a fairly raw form (one row per line from the play), and requires some pre-processing to get to a usable form
  • Need to transform raw text in to a tidy format - one row per word (or word cluster as the case may be)
    • The tidytext::unnest_tokens(myOutputCol, myInputCol) # will tokenize myInputCol in to myOutputCol, considering each word to be a token
    • Defaults for unnest_tokens() are that punctuation will be removed, lower case will be created, and spaces/punctuations will be the word-splits

Using count and mutate:

  • May want to create rations rather than just counts of positive and negative
  • The dplyr::mutate() can calculate new variables, and they will be calculated by group if the group_by() is currently active
    • For example, a %>% group_by(b) %>% mutate(subTotal = sum(d)) # will sum d for each sub-group b
    • dfTemp <- data.frame(pet=rep(c(“cat”, “dog”), times=c(3, 4)), food=c(2:4, seq(1, 7, by=2)))
    • dfTemp %>% group_by(pet) %>% mutate(subFood = sum(food)) %>% ungroup() %>% mutate(totFood = sum(food)) # subFood is 9 for 3 cat records and 16 for 4 dog records while totFood is 25 for all 7 records

Sentiment contributions by individual words:

  • Sometimes a word will be misidentified by the lexicon, which can be confirmed through SME (e.g., wilt used to be will in the 16th century but now it probably means fading)
    • shakespeare %>% filter(str_detect(text, “wilt”)) %>% select(text) # identify lines using the word
    • tidy_shakespeare %>% anti_join(data_frame(word = “wilt”)) # remove word prior to sentiment analysis

Which words are important in each play?

  • Helpful to understand the most important words by sentiment and document; good for many purposes including sanity checks by SME
  • Tidy data makes sentiment analysis easier and more intuitive
    • For example, can calculate the net sentiment (positive vs. negative) by chunk of lines - evolution over the course of the text

Example code includes:

load("./RInputFiles/shakespeare.rda")

# The data set shakespeare in available in the workspace
shakespeare
## # A tibble: 25,888 x 3
##    title                           type    text                           
##    <chr>                           <chr>   <chr>                          
##  1 The Tragedy of Romeo and Juliet Tragedy The Complete Works of William ~
##  2 The Tragedy of Romeo and Juliet Tragedy ""                             
##  3 The Tragedy of Romeo and Juliet Tragedy The Tragedy of Romeo and Juliet
##  4 The Tragedy of Romeo and Juliet Tragedy ""                             
##  5 The Tragedy of Romeo and Juliet Tragedy The Library of the Future Comp~
##  6 The Tragedy of Romeo and Juliet Tragedy Library of the Future is a Tra~
##  7 The Tragedy of Romeo and Juliet Tragedy ""                             
##  8 The Tragedy of Romeo and Juliet Tragedy ""                             
##  9 The Tragedy of Romeo and Juliet Tragedy <<THIS ELECTRONIC VERSION OF T~
## 10 The Tragedy of Romeo and Juliet Tragedy SHAKESPEARE IS COPYRIGHT 1990-~
## # ... with 25,878 more rows
# Pipe the shakespeare data frame to the next line
shakespeare %>% 
  # Use count to find out how many titles/types there are
  count(title, type)
## # A tibble: 6 x 3
##   title                           type        n
##   <chr>                           <chr>   <int>
## 1 A Midsummer Night's Dream       Comedy   3459
## 2 Hamlet, Prince of Denmark       Tragedy  6776
## 3 Much Ado about Nothing          Comedy   3799
## 4 The Merchant of Venice          Comedy   4225
## 5 The Tragedy of Macbeth          Tragedy  3188
## 6 The Tragedy of Romeo and Juliet Tragedy  4441
tidy_shakespeare <- shakespeare %>%
  # Group by the titles of the plays
  group_by(title) %>%
  # Define a new column linenumber
  mutate(linenumber=row_number()) %>%
  # Transform the non-tidy text data to tidy text data
  tidytext::unnest_tokens(word, text) %>%
  ungroup()

# Pipe the tidy Shakespeare data frame to the next line
tidy_shakespeare %>% 
  # Use count to find out how many times each word is used
  count(word, sort = TRUE)
## # A tibble: 10,736 x 2
##    word      n
##    <chr> <int>
##  1 the    4651
##  2 and    4170
##  3 i      3296
##  4 to     3047
##  5 of     2645
##  6 a      2511
##  7 you    2287
##  8 my     1913
##  9 in     1836
## 10 that   1721
## # ... with 10,726 more rows
shakespeare_sentiment <- tidy_shakespeare %>%
  # Implement sentiment analysis with the "bing" lexicon
  inner_join(tidytext::get_sentiments("bing"), by=c("word")) 

shakespeare_sentiment %>%
  # Find how many positive/negative words each play has
  count(title, sentiment)
## # A tibble: 12 x 3
##    title                           sentiment     n
##    <chr>                           <chr>     <int>
##  1 A Midsummer Night's Dream       negative    681
##  2 A Midsummer Night's Dream       positive    773
##  3 Hamlet, Prince of Denmark       negative   1323
##  4 Hamlet, Prince of Denmark       positive   1223
##  5 Much Ado about Nothing          negative    767
##  6 Much Ado about Nothing          positive   1127
##  7 The Merchant of Venice          negative    740
##  8 The Merchant of Venice          positive    962
##  9 The Tragedy of Macbeth          negative    914
## 10 The Tragedy of Macbeth          positive    749
## 11 The Tragedy of Romeo and Juliet negative   1235
## 12 The Tragedy of Romeo and Juliet positive   1090
sentiment_counts <- tidy_shakespeare %>%
    # Implement sentiment analysis using the "bing" lexicon
    inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
    # Count the number of words by title, type, and sentiment
    count(title, type, sentiment)

sentiment_counts %>%
    # Group by the titles of the plays
    group_by(title) %>%
    # Find the total number of words in each play
    mutate(total = sum(n),
    # Calculate the number of words divided by the total
           percent = n / total) %>%
    # Filter the results for only negative sentiment
    filter(sentiment == "negative") %>%
    arrange(percent)
## # A tibble: 6 x 6
## # Groups:   title [6]
##   title                           type    sentiment     n total percent
##   <chr>                           <chr>   <chr>     <int> <int>   <dbl>
## 1 Much Ado about Nothing          Comedy  negative    767  1894   0.405
## 2 The Merchant of Venice          Comedy  negative    740  1702   0.435
## 3 A Midsummer Night's Dream       Comedy  negative    681  1454   0.468
## 4 Hamlet, Prince of Denmark       Tragedy negative   1323  2546   0.520
## 5 The Tragedy of Romeo and Juliet Tragedy negative   1235  2325   0.531
## 6 The Tragedy of Macbeth          Tragedy negative    914  1663   0.550
# Notice what the line mutate(word = reorder(word, n)) does; it converts word from a character that would be plotted in alphabetical order to a factor that will be plotted in order of n
word_counts <- tidy_shakespeare %>%
  # Implement sentiment analysis using the "bing" lexicon
  inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
  # Count by word and sentiment
  count(word, sentiment)

top_words <- word_counts %>%
  # Group by sentiment
  group_by(sentiment) %>%
  # Take the top 10 for each sentiment
  top_n(10) %>%
  ungroup() %>%
  # Make word a factor in order of n
  mutate(word = reorder(word, n))
## Selecting by n
# Use aes() to put words on the x-axis and n on the y-axis
ggplot(top_words, aes(x=word, y=n, fill = sentiment)) +
  # Make a bar chart with geom_col()
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +  
  coord_flip()

# Correct! The word “wilt” was used differently in Shakespeare's time and was not negative; the lexicon has misidentified it
# For example, from Romeo and Juliet, “For thou wilt lie upon the wings of night”
# It is important to explore the details of how words were scored when performing sentiment analyses


tidy_shakespeare %>%
  # Count by title and word
  count(title, word, sort = TRUE) %>%
  # Implement sentiment analysis using the "afinn" lexicon
  inner_join(tidytext::get_sentiments("afinn"), by=c("word")) %>%
  # Filter to only examine the scores for Macbeth that are negative
  filter(title == "The Tragedy of Macbeth", score < 0)
## # A tibble: 237 x 4
##    title                  word        n score
##    <chr>                  <chr>   <int> <int>
##  1 The Tragedy of Macbeth no         73    -1
##  2 The Tragedy of Macbeth fear       35    -2
##  3 The Tragedy of Macbeth death      20    -2
##  4 The Tragedy of Macbeth bloody     16    -3
##  5 The Tragedy of Macbeth poor       16    -2
##  6 The Tragedy of Macbeth strange    16    -1
##  7 The Tragedy of Macbeth dead       14    -3
##  8 The Tragedy of Macbeth leave      14    -1
##  9 The Tragedy of Macbeth fight      13    -1
## 10 The Tragedy of Macbeth charges    11    -2
## # ... with 227 more rows
sentiment_contributions <- tidy_shakespeare %>%
  # Count by title and word
  count(title, word, sort = TRUE) %>%
  # Implement sentiment analysis using the "afinn" lexicon
  inner_join(tidytext::get_sentiments("afinn"), by=c("word")) %>%
  # Group by title
  group_by(title) %>%
  # Calculate a contribution for each word in each title
  mutate(contribution = n * score / sum(n)) %>%
  ungroup()
    
sentiment_contributions
## # A tibble: 2,366 x 5
##    title                           word      n score contribution
##    <chr>                           <chr> <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark       no      143    -1      -0.0652
##  2 The Tragedy of Romeo and Juliet love    140     3       0.213 
##  3 Much Ado about Nothing          no      132    -1      -0.0768
##  4 Much Ado about Nothing          hero    114     2       0.133 
##  5 A Midsummer Night's Dream       love    110     3       0.270 
##  6 Hamlet, Prince of Denmark       good    109     3       0.149 
##  7 The Tragedy of Romeo and Juliet no      102    -1      -0.0518
##  8 Much Ado about Nothing          good     93     3       0.162 
##  9 The Merchant of Venice          no       92    -1      -0.0630
## 10 Much Ado about Nothing          love     91     3       0.159 
## # ... with 2,356 more rows
sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "Hamlet, Prince of Denmark") %>%
  # Arrange to see the most negative words
  arrange(contribution)
## # A tibble: 493 x 5
##    title                     word        n score contribution
##    <chr>                     <chr>   <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark no        143    -1      -0.0652
##  2 Hamlet, Prince of Denmark dead       33    -3      -0.0451
##  3 Hamlet, Prince of Denmark death      38    -2      -0.0347
##  4 Hamlet, Prince of Denmark madness    22    -3      -0.0301
##  5 Hamlet, Prince of Denmark mad        21    -3      -0.0287
##  6 Hamlet, Prince of Denmark fear       21    -2      -0.0192
##  7 Hamlet, Prince of Denmark poor       20    -2      -0.0182
##  8 Hamlet, Prince of Denmark hell       10    -4      -0.0182
##  9 Hamlet, Prince of Denmark grave      17    -2      -0.0155
## 10 Hamlet, Prince of Denmark ghost      32    -1      -0.0146
## # ... with 483 more rows
sentiment_contributions %>%
  # Filter for The Merchant of Venice
  filter(title == "The Merchant of Venice") %>%
  # Arrange to see the most positive words
  arrange(desc(contribution))
## # A tibble: 344 x 5
##    title                  word        n score contribution
##    <chr>                  <chr>   <int> <int>        <dbl>
##  1 The Merchant of Venice good       63     3       0.129 
##  2 The Merchant of Venice love       60     3       0.123 
##  3 The Merchant of Venice fair       35     2       0.0479
##  4 The Merchant of Venice like       34     2       0.0466
##  5 The Merchant of Venice true       24     2       0.0329
##  6 The Merchant of Venice sweet      23     2       0.0315
##  7 The Merchant of Venice pray       42     1       0.0288
##  8 The Merchant of Venice better     21     2       0.0288
##  9 The Merchant of Venice justice    17     2       0.0233
## 10 The Merchant of Venice welcome    17     2       0.0233
## # ... with 334 more rows
# After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play
# These sections will be 70 lines long in your analysis here
# You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure)
# In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays

tidy_shakespeare %>%
  # Implement sentiment analysis using "bing" lexicon
  inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
  # Count using four arguments
  count(title, type, index = linenumber %/% 70, sentiment)
## # A tibble: 744 x 5
##    title                     type   index sentiment     n
##    <chr>                     <chr>  <dbl> <chr>     <int>
##  1 A Midsummer Night's Dream Comedy  0    negative      4
##  2 A Midsummer Night's Dream Comedy  0    positive     11
##  3 A Midsummer Night's Dream Comedy  1.00 negative      7
##  4 A Midsummer Night's Dream Comedy  1.00 positive     19
##  5 A Midsummer Night's Dream Comedy  2.00 negative     20
##  6 A Midsummer Night's Dream Comedy  2.00 positive     23
##  7 A Midsummer Night's Dream Comedy  3.00 negative     12
##  8 A Midsummer Night's Dream Comedy  3.00 positive     18
##  9 A Midsummer Night's Dream Comedy  4.00 negative      9
## 10 A Midsummer Night's Dream Comedy  4.00 positive     27
## # ... with 734 more rows
tidy_shakespeare %>%
  inner_join(tidytext::get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  # Spread sentiment and n across multiple columns
  tidyr::spread(sentiment, n, fill = 0) %>%
  # Use mutate to find net sentiment
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 373 x 6
##    title                     type   index negative positive sentiment
##    <chr>                     <chr>  <dbl>    <dbl>    <dbl>     <dbl>
##  1 A Midsummer Night's Dream Comedy  0        4.00    11.0       7.00
##  2 A Midsummer Night's Dream Comedy  1.00     7.00    19.0      12.0 
##  3 A Midsummer Night's Dream Comedy  2.00    20.0     23.0       3.00
##  4 A Midsummer Night's Dream Comedy  3.00    12.0     18.0       6.00
##  5 A Midsummer Night's Dream Comedy  4.00     9.00    27.0      18.0 
##  6 A Midsummer Night's Dream Comedy  5.00    11.0     21.0      10.0 
##  7 A Midsummer Night's Dream Comedy  6.00    12.0     16.0       4.00
##  8 A Midsummer Night's Dream Comedy  7.00     9.00     6.00    - 3.00
##  9 A Midsummer Night's Dream Comedy  8.00     6.00    12.0       6.00
## 10 A Midsummer Night's Dream Comedy  9.00    19.0     12.0     - 7.00
## # ... with 363 more rows
tidy_shakespeare %>%
  inner_join(tidytext::get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  tidyr::spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  # Put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
  ggplot(aes(x=index, y=sentiment, fill=type)) +
  # Make a bar chart with geom_col()
  geom_col() +
  # Separate panels for each title with facet_wrap()
  facet_wrap(~ title, scales = "free_x")
## Joining, by = "word"


Chapter 3 - Analyzing TV News

That’s the way it is:

  • Spoken text can be grabbed from TV shows based on archived closed captioning
  • Snippets spoken about “climate change” over 7 years
    • The climate_text tibble has station-show-show_date-text
    • Text is initially fairly raw; unnest_tokens() will be needed

Comparing TV stations:

  • Data from CNN, Fox, MSNBC - multiple shows that mentioned “climate change” over 7 years
  • May want to compare proportions of negative words used by each station
    • When running count(), can use an integer that is constant by the other groups as one of the “by” variables, to keep the value of that integer for future analysis

Sentiment changes with time:

  • Desire to look at changes in sentiment over time (time is a variable included in the initial tibble)
  • Can use lubridate::floor_date() to convert every time to the next-lowest increment of units
    • floor_date(x, unit=“3 months”) # will convert x to the start of the quarter

Example code includes:

# Take a look at the dataset of TV news text about climate change you will use in this chapter. The climate_text dataset contains almost 600 closed captioning snippets and four columns
# station, the TV news station where the text is from,
# show, the show on that station where the text was spoken,
# show_date, the broadcast date of the spoken text, and
# text, the actual text spoken on TV
# Type climate_text in the console to take a look at the dataset before getting started with transforming it to a tidy format.


load("./RInputFiles/climate_text.rda")
climate_text
## # A tibble: 593 x 4
##    station show                                 show_date           text  
##    <chr>   <chr>                                <dttm>              <chr> 
##  1 MSNBC   Morning Meeting                      2009-09-22 13:00:00 the i~
##  2 MSNBC   Morning Meeting                      2009-10-23 13:00:00 corpo~
##  3 CNN     CNN Newsroom                         2009-12-03 20:00:00 he sa~
##  4 CNN     American Morning                     2009-12-07 11:00:00 espec~
##  5 MSNBC   Morning Meeting                      2009-12-08 14:00:00 lots ~
##  6 MSNBC   Countdown With Keith Olbermann       2009-12-10 06:00:00 so th~
##  7 CNN     Sanjay Gupta MD                      2009-12-12 12:30:00 let m~
##  8 CNN     The Situation Room With Wolf Blitzer 2009-12-16 21:00:00 other~
##  9 MSNBC   Countdown With Keith Olbermann       2009-12-19 01:00:00 let d~
## 10 MSNBC   The Rachel Maddow Show               2010-01-08 04:00:00 you k~
## # ... with 583 more rows
data(stop_words, package="tidytext")
stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows
# Pipe the climate_text dataset to the next line
tidy_tv <- climate_text %>%
    # Transform the non-tidy text data to tidy text data
    tidytext::unnest_tokens(word, text)


tidy_tv %>% 
    anti_join(stop_words) %>%
    # Count by word with sort = TRUE
    count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 3,699 x 2
##    word          n
##    <chr>     <int>
##  1 climate    1627
##  2 change     1615
##  3 people      139
##  4 real        125
##  5 president   112
##  6 global      107
##  7 issue        87
##  8 trump        86
##  9 warming      85
## 10 issues       69
## # ... with 3,689 more rows
tidy_tv %>%
    # Count by station
    count(station) %>%
    # Rename the new column station_total
    rename(station_total = n)
## # A tibble: 3 x 2
##   station  station_total
##   <chr>            <int>
## 1 CNN              10713
## 2 FOX News         10876
## 3 MSNBC            19487
tv_sentiment <- tidy_tv %>% 
    # Group by station
    group_by(station) %>% 
    # Define a new column station_total
    mutate(station_total = n()) %>%
    ungroup() %>%
    # Implement sentiment analysis with the NRC lexicon
    inner_join(tidytext::get_sentiments("nrc"), by=c("word"))


# Which stations use the most negative words?
tv_sentiment %>% 
    count(station, sentiment, station_total) %>%
    # Define a new column percent
    mutate(percent = n / station_total) %>%
    # Filter only for negative words
    filter(sentiment == "negative") %>%
    # Arrange by percent
    arrange(percent)
## # A tibble: 3 x 5
##   station  sentiment station_total     n percent
##   <chr>    <chr>             <int> <int>   <dbl>
## 1 MSNBC    negative          19487   526  0.0270
## 2 CNN      negative          10713   331  0.0309
## 3 FOX News negative          10876   403  0.0371
# Now do the same but for positive words
tv_sentiment %>% 
    count(station, sentiment, station_total) %>%
    mutate(percent = n / station_total) %>%
    filter(sentiment == "positive") %>%
    arrange(percent)
## # A tibble: 3 x 5
##   station  sentiment station_total     n percent
##   <chr>    <chr>             <int> <int>   <dbl>
## 1 FOX News positive          10876   514  0.0473
## 2 CNN      positive          10713   522  0.0487
## 3 MSNBC    positive          19487   953  0.0489
tv_sentiment %>%
    # Count by word and sentiment
    count(word, sentiment) %>%
    # Group by sentiment
    group_by(sentiment) %>%
    # Take the top 10 words for each sentiment
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(word, n)) %>%
    # Set up the plot with aes()
    ggplot(aes(x=word, y=n, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip()
## Selecting by n

# Excellent!
# Notice that you see proper names like Gore and Trump, which should be treated as neutral, and that “change” was a strong driver of fear sentiment, even though it is by definition part of these texts on climate change
# It is important to see which words contribute to your sentiment scores so you can adjust the sentiment lexicons if appropriate

tv_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Count by word and station
    count(word, station) %>%
    # Group by station
    group_by(station) %>%
    # Take the top 10 words for each station
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(paste(word, station, sep = "__"), n)) %>%
    # Set up the plot with aes()
    ggplot(aes(x=word, y=n, fill=station)) +
    geom_col(show.legend = FALSE) +
    scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
    facet_wrap(~ station, nrow = 2, scales = "free") +
    coord_flip()
## Selecting by n

sentiment_by_time <- tidy_tv %>%
    # Define a new column using floor_date()
    mutate(date = lubridate::floor_date(show_date, unit = "6 months")) %>%
    # Group by date
    group_by(date) %>%
    mutate(total_words = n()) %>%
    ungroup() %>%
    # Implement sentiment analysis using the NRC lexicon
    inner_join(tidytext::get_sentiments("nrc"), by=c("word"))

sentiment_by_time %>%
    # Filter for positive and negative words
    filter(sentiment == "positive" | sentiment == "negative") %>%
    # Count by date, sentiment, and total_words
    count(date, sentiment, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words) %>%
    # Set up the plot with aes()
    ggplot(aes(x=date, y=percent, color=sentiment)) +
    geom_line(size = 1.5) +
    geom_smooth(method = "lm", se = FALSE, lty = 2) +
    expand_limits(y = 0)

tidy_tv %>%
    # Define a new column that rounds each date to the nearest 1 month
    mutate(date = lubridate::floor_date(show_date, unit="1 months")) %>%
    filter(word %in% c("threat", "hoax", "denier",
                       "real", "warming", "hurricane")) %>%
    # Count by date and word
    count(date, word) %>%
    ungroup() %>%
    # Set up your plot with aes()
    ggplot(aes(x=date, y=n, color=word)) +
    # Make facets by word
    facet_wrap(~ word) +
    geom_line(size = 1.5, show.legend = FALSE) +
    expand_limits(y = 0)


Chapter 4 - Singing a Happy Song

Ranking pop songs through the years:

  • Texts are significantly different from each other, but the tidy approach is flexible to adapt to this
  • The tibble song_lyrics is available, and contains rank-song-artist-year-lyrics
    • Will need to tidy using unnest_tokens()
    • Can run count(song, sentiment, total_words)

Connecting sentiment to other quantities:

  • Other information is available about the songs, allowing for looking at sentiment by attribute
    • Billboard rank, year of release (1965 - 2015), etc.
  • Can look at boxplots for distributions of interest

Moving from song rank to year:

  • Sentiment content of pop songs by decade - count(song, year, total_words)
  • Use mutate() to define proportions of sentiment words
  • Plot as a box plot
  • Build the linear model for sentiment vs. year using lm()

Wrap up:

  • Text Mining with R is available at tidytextmining.com

Example code includes:

# Let's take a look at the dataset you will use in this final chapter to practice your sentiment analysis skills
# The song_lyrics dataset contains five columns
# rank, the rank a song achieved on the Billboard Year-End Hot 100,
# song, the song's title,
# artist, the artist who recorded the song,
# year, the year the song reached the given rank on the Billboard chart, and
# lyrics, the lyrics of the song

# This dataset contains over 5000 songs, from 1965 to the present
# The lyrics are all in one column, so they are not yet in a tidy format, ready for analysis using tidy tools
# It's your turn to tidy this text data!


load("./RInputFiles/song_lyrics.rda")
song_lyrics
## # A tibble: 4,831 x 5
##     rank song                                     artist   year lyrics    
##    <int> <chr>                                    <chr>   <int> <chr>     
##  1     1 wooly bully                              sam th~  1965 sam the s~
##  2     2 i cant help myself sugar pie honey bunch four t~  1965 sugar pie~
##  3     4 you were on my mind                      we five  1965 when i wo~
##  4     5 youve lost that lovin feelin             the ri~  1965 you never~
##  5     6 downtown                                 petula~  1965 when your~
##  6     7 help                                     the be~  1965 help i ne~
##  7     8 cant you hear my heart beat              herman~  1965 carterlew~
##  8     9 crying in the chapel                     elvis ~  1965 you saw m~
##  9    10 my girl                                  the te~  1965 ive got s~
## 10    11 help me rhonda                           the be~  1965 well sinc~
## # ... with 4,821 more rows
data(stop_words, package="tidytext")
stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows
# Pipe song_lyrics to the next line
tidy_lyrics <- song_lyrics %>% 
  # Transform the lyrics column to a word column
  tidytext::unnest_tokens(word, lyrics)

# Print tidy_lyrics
tidy_lyrics
## # A tibble: 1,602,879 x 5
##     rank song        artist                         year word         
##    <int> <chr>       <chr>                         <int> <chr>        
##  1     1 wooly bully sam the sham and the pharaohs  1965 sam          
##  2     1 wooly bully sam the sham and the pharaohs  1965 the          
##  3     1 wooly bully sam the sham and the pharaohs  1965 sham         
##  4     1 wooly bully sam the sham and the pharaohs  1965 miscellaneous
##  5     1 wooly bully sam the sham and the pharaohs  1965 wooly        
##  6     1 wooly bully sam the sham and the pharaohs  1965 bully        
##  7     1 wooly bully sam the sham and the pharaohs  1965 wooly        
##  8     1 wooly bully sam the sham and the pharaohs  1965 bully        
##  9     1 wooly bully sam the sham and the pharaohs  1965 sam          
## 10     1 wooly bully sam the sham and the pharaohs  1965 the          
## # ... with 1,602,869 more rows
# For some next steps in this analysis, you need to know the total number of words sung in each song
# Use count() to count up the words per song, and then left_join() these word totals to the tidy data set
# You can specify exactly which column to use when joining the two data frames if you add by = "song"

totals <- tidy_lyrics %>%
  # Count by song to find the word totals for each song
  count(song) %>%
  # Rename the new column
  rename(total_words = n)

# Print totals    
totals
## # A tibble: 4,341 x 2
##    song                   total_words
##    <chr>                        <int>
##  1 0 to 100  the catch up         894
##  2 1 2 3 4 sumpin new             670
##  3 1 2 3 red light                145
##  4 1 2 step                       437
##  5 1 thing                        532
##  6 100 pure love                  590
##  7 100 years                      257
##  8 123                            220
##  9 18 and life                    285
## 10 19 somethin                    281
## # ... with 4,331 more rows
lyric_counts <- tidy_lyrics %>%
  # Combine totals with tidy_lyrics using the "song" column
  left_join(totals, by = c("song"))


# You have been practicing how to implement sentiment analysis with a join throughout this course
# After transforming the text of these songs to a tidy text dataset and preparing the data frame, the resulting data frame lyric_counts is ready for you to perform sentiment analysis once again
# Once you have done the sentiment analysis, you can learn which songs have the most sentiment words from the NRC lexicon
# Remember that the NRC lexicon has 10 categories of sentiment:
# anger
# anticipation
# disgust
# fear
# joy
# negative
# positive
# sadness
# surprise
# trust


lyric_sentiment <- lyric_counts %>%
    # Implement sentiment analysis with the "nrc" lexicon
    inner_join(tidytext::get_sentiments("nrc"), by=c("word"))

lyric_sentiment %>%
    # Find how many sentiment words each song has
    count(song, sentiment, sort = TRUE)
## # A tibble: 39,564 x 3
##    song           sentiment     n
##    <chr>          <chr>     <int>
##  1 baby           positive    264
##  2 baby           joy         255
##  3 real love      positive    213
##  4 angel          positive    193
##  5 disturbia      negative    182
##  6 live your life positive    174
##  7 my love        positive    173
##  8 angel          joy         164
##  9 damn           negative    164
## 10 disturbia      sadness     164
## # ... with 39,554 more rows
# What songs have the highest proportion of negative words?
lyric_sentiment %>%
    # Count using three arguments
    count(song, sentiment, total_words) %>%
    ungroup() %>%
    # Make a new percent column with mutate 
    mutate(percent = n / total_words) %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Arrange by descending percent
    arrange(desc(percent))
## # A tibble: 4,169 x 5
##    song                           sentiment total_words     n percent
##    <chr>                          <chr>           <int> <int>   <dbl>
##  1 bad boy                        negative          237    77   0.325
##  2 rack city                      negative          458   142   0.310
##  3 ill tumble 4 ya                negative          269    79   0.294
##  4 time wont let me               negative          154    42   0.273
##  5 bang bang my baby shot me down negative          163    40   0.245
##  6 the stroke                     negative          279    57   0.204
##  7 the wild boys                  negative          245    49   0.200
##  8 pop that thang                 negative          204    40   0.196
##  9 disturbia                      negative          956   182   0.190
## 10 dance a                        negative          407    72   0.177
## # ... with 4,159 more rows
# What songs have the highest proportion of positive words?
lyric_sentiment %>%
    count(song, sentiment, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words) %>%
    filter(sentiment == "positive") %>%
    arrange(desc(percent))
## # A tibble: 4,295 x 5
##    song                                sentiment total_words     n percent
##    <chr>                               <chr>           <int> <int>   <dbl>
##  1 love to love you baby               positive          240    78   0.325
##  2 dance dance dance yowsah yowsah yo~ positive          305    94   0.308
##  3 i got the feelin                    positive          141    35   0.248
##  4 i love music                        positive          252    61   0.242
##  5 sweet and innocent                  positive          218    51   0.234
##  6 me and baby brother                 positive          181    42   0.232
##  7 love hangover                       positive          173    40   0.231
##  8 sweet cream ladies                  positive          179    41   0.229
##  9 mighty love                         positive          482   110   0.228
## 10 keep feeling fascination            positive          189    43   0.228
## # ... with 4,285 more rows
# The lyric_sentiment data frame that you created earlier by using inner_join() is available in your environment
# You can now explore how the sentiment score of a song is related to other aspects of that song
# First, start with Billboard rank, how high on the annual Billboard chart the song reached
# Do songs that use more positive or negative words achieve higher or lower ranks?
# Start with positive words, and make a visualization to see how these characteristics are related

lyric_sentiment %>%
    filter(sentiment == "positive") %>%
    # Count by song, Billboard rank, and the total number of words
    count(song, rank, total_words) %>%
    ungroup() %>%
    # Use the correct dplyr verb to make two new columns
    mutate(percent = n / total_words,
           rank = 10 * floor(rank / 10)) %>%
    ggplot(aes(as.factor(rank), percent)) +
    # Make a boxplot
    geom_boxplot()

lyric_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Count by song, Billboard rank, and the total number of words
    count(song, rank, total_words) %>%
    ungroup() %>%
    # Mutate to make a percent column
    mutate(percent = n / total_words,
           rank = 10 * floor(rank / 10)) %>%
    # Use ggplot to set up a plot with rank and percent
    ggplot(aes(x=as.factor(rank), y=percent)) +
    # Make a boxplot
    geom_boxplot()

# How is negative sentiment changing over time?
lyric_sentiment %>%
    # Filter for only negative words
    filter(sentiment == "negative") %>%
    # Count by song, year, and the total number of words
    count(song, year, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words,
           year = 10 * floor(year / 10)) %>%
    # Use ggplot to set up a plot with year and percent
    ggplot(aes(x=as.factor(year), y=percent)) +
    geom_boxplot()

# How is positive sentiment changing over time?
lyric_sentiment %>%
    filter(sentiment == "positive") %>%
    count(song, year, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words,
           year = 10 * floor(year / 10)) %>%
    ggplot(aes(x=as.factor(year), y=percent)) +
    geom_boxplot()

negative_by_year <- lyric_sentiment %>%
    # Filter for negative words
    filter(sentiment == "negative") %>%
    count(song, year, total_words) %>%
    ungroup() %>%
    # Define a new column: percent
    mutate(percent = n/total_words)

# Specify the model with percent as the response and year as the predictor
model_negative <- lm(percent ~ year, data = negative_by_year)

# Use summary to see the results of the model fitting
summary(model_negative)
## 
## Call:
## lm(formula = percent ~ year, data = negative_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.030288 -0.017205 -0.005778  0.010505  0.294194 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.809e-02  5.022e-02   0.758    0.448
## year        -3.720e-06  2.523e-05  -0.147    0.883
## 
## Residual standard error: 0.02513 on 4624 degrees of freedom
## Multiple R-squared:  4.702e-06,  Adjusted R-squared:  -0.0002116 
## F-statistic: 0.02174 on 1 and 4624 DF,  p-value: 0.8828
positive_by_year <- lyric_sentiment %>%
    filter(sentiment == "positive") %>%
    # Count by song, year, and total number of words
    count(song, year, total_words) %>%
    ungroup() %>%
    # Define a new column: percent
    mutate(percent = n/total_words)

# Fit a linear model with percent as the response and year as the predictor
model_positive <- lm(percent ~ year, data=positive_by_year)

# Use summary to see the results of the model fitting
summary(model_positive)
## 
## Call:
## lm(formula = percent ~ year, data = positive_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.058050 -0.024032 -0.007756  0.014774  0.269726 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.117e+00  6.859e-02   16.29   <2e-16 ***
## year        -5.373e-04  3.446e-05  -15.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03495 on 4770 degrees of freedom
## Multiple R-squared:  0.0485, Adjusted R-squared:  0.0483 
## F-statistic: 243.1 on 1 and 4770 DF,  p-value: < 2.2e-16

Supervised Learning R: Regression

Chapter 1 - What is Regression?

Introduction - for this course, regression will be about getting a numerical (rather than categorical) prediction:

  • From a scientific mindset, regression is about understanding the data generation process
  • From an engineering mindset, regression is about modeling to make good predictions
    • Machine learning (and this course) take the engineering mindset

Linear regression - fundamental method:

  • Simple regresion assumes an additive and linear model
    • Fit using lm(y ~ x, data=)
    • Can convert a character string to a formula using as.formula()
  • Can see the model with print() and a summary of diagnostics with summary()
    • The broom::glance() also gives a nice look at key statistics and coefficients

Predicting once you fit a model:

  • If model myLM exists, predict(myLM) will give the fitted values
    • predict(myLM, newdata=) will apply the model to new data (must have the same variables)

Wrap up for simple linear regression:

  • Advantages - easy to fit, concise, less prone to overfitting (test and train accuracy is similar), relatively interpretible
  • Disadvantages - cannot find complex, non-additive, non-linear relationships
    • Also can have problems with collinearity, leading to non-sensical coefficients
    • Model prediction accuracy is generally fine with collinearity, but double-check with the test data set

Example code includes:

unemployment <- readRDS("./RInputFiles/unemployment.rds")
bloodpressure <- readRDS("./RInputFiles/bloodpressure.rds")


# The data frame unemployment is in your workspace
# unemployment is loaded in the workspace
summary(unemployment)
##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data=unemployment)

# Print it
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945
# There are a variety of different ways to examine a model; each way provides different information
# We will use summary(), broom::glance(), and sigr::wrapFTest()
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
broom::glance(unemployment_model)
##   r.squared adj.r.squared     sigma statistic      p.value df    logLik
## 1 0.8213157     0.8050716 0.5802596  50.56108 1.965985e-05  2 -10.28471
##        AIC      BIC deviance df.residual
## 1 26.56943 28.26428 3.703714          11
# Call wrapFTest() on unemployment_model to see the most relevant details
sigr::wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.821, F(1,11)=50.6, p=2e-05)."
# The objects unemployment, unemployment_model and newrates are in your workspace
newrates <- data.frame(male_unemployment=5)

# newrates is in your workspace
newrates
##   male_unemployment
## 1                 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata=newrates)
# Print it
pred
##        1 
## 4.906757
# In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
# The data frame bloodpressure is in the workspace

# bloodpressure is in the workspace
summary(bloodpressure)
##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data=bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349
summary(bloodpressure_model)
## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model)

# plot the results
ggplot(bloodpressure, aes(x=prediction, y=blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")


Chapter 2 - Training and Evaluating Regression Models

Evaluating models graphically:

  • Visual inspection is very helpful for assessing model fit - closeness to line, bias of residuals, etc.
    • Residuals vs. prediction graph is especially helpful
  • The Gain curve can be helpful if the goal is to make ordinal predictions
    • The x-axis is the model-sorted order (decreasing predictions)
    • The y-axis is the fraction of total target predicted with the highest x predictions
    • Forms a Gini curve, with the Wizard curve considered to be the perfect model

Root Mean Squared Effort (RMSE):

  • MSE is the mean-squared error for the model - mean of (y-bar - y)**2
    • RMSE is the square root of RMSE, or the average magnitude of the prediction error
  • One way to evaluate RMSE is to compare to the standard deviation of the dependent variable

R-Squared - value between 0 and 1 for quality of model fit:

  • R-squared = 1 - RSS / SST where RSS is the residual sum-squares (variance still in the residuals) and SST is the total sum-squares (total variance in the data)
  • Can get the R-squared for a linear regression model using summary() or broom::glance()

Properly Training a Model:

  • Models tend to perform better on training data than on similar data they have not yet seen
    • Overfits are when the training error is much less than the test (unseen) data
  • Cross-validation is also a valuable tool for predicting the out-of-sample error rate
    • For k-fold CV, split the data in to k groups, then train the model on each possible set of k-1 groups
    • The average error of predicting in to the one hold-out group for each train is the likely test error
    • Helps provide an unbiased estimate of RMSE
  • Cross-validation is mainly a test of the modeling process; always ideal to have a true out-of-sample test set if it can be found
  • If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:
    • dframe$pred.cv <- 0
    • for(i in 1:k) {
    • split <- splitPlan[[i]]
    • model <- lm(fmla, data = dframe[split$train,])
    • dframe\(pred.cv[split\)app] <- predict(model, newdata = dframe[split$app,])
    • }

Example code includes:

# The data frame unemployment and model unemployment_model are available in the workspace
# unemployment is in the workspace
summary(unemployment)
##  male_unemployment female_unemployment   prediction   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240
# unemployment_model is in the workspace
summary(unemployment_model)
## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# Calculate residuals
unemployment$residuals <- residuals(unemployment_model)

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

# In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals
# Now, you will also plot the gain curve of the unemployment_model's predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function

# For situations where order is more important than exact values, the gain curve helps you check if the model's predictions sort in the same order as the true outcome.
# Calls to the function GainCurvePlot() look like:
# GainCurvePlot(frame, xvar, truthvar, title)
# frame is a data frame
# xvar and truthvar are strings naming the prediction and actual outcome columns of frame
# title is the title of the plot

# When the predictions sort in exactly the same order, the relative Gini coefficient is 1
# When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative

# The data frame unemployment and the model unemployment_model are in the workspace


# Load the package WVPlots
# library(WVPlots)

# Plot the Gain Curve
WVPlots::GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res**2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals ** 2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss/tss)
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$female_unemployment, unemployment$predictions))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho ** 2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
# For the next several exercises you will use the mpg data from the package ggplot2
# The data describes the characteristics of several makes and models of cars from different years
# The goal is to predict city fuel efficiency from highway fuel efficiency

# In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data)
# One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif()

# If you have a data set dframe of size NN, and you want a random subset of approximately size 100∗X100∗X% of NN (where XX is between 0 and 1), then
# Generate a vector of uniform random numbers: gp = runif(N).
# dframe[gp < X,] will be about the right size.
# dframe[gp >= X,] will be the complement.

mpg <- readRDS("./RInputFiles/mpg.rds")
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
dim(mpg)
## [1] 234  11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75 * N))
## [1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp <= 0.75, ]
mpg_test <- mpg[gp > 0.75, ]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 190
nrow(mpg_test)
## [1] 44
# mpg_train is in the workspace
summary(mpg_train)
##  manufacturer          model               displ            year     
##  Length:190         Length:190         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :1999  
##                                        Mean   :3.469   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :6.500   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:190         Length:190         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.916                                         Mean   :17.01  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:190         Length:190        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.68                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, data=mpg_train)

# Use summary() to examine the model
summary(mpg_model)
## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9064 -0.7184 -0.0030  0.6241  4.0809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78448    0.35983    2.18   0.0305 *  
## hwy          0.68488    0.01473   46.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.209 on 188 degrees of freedom
## Multiple R-squared:   0.92,  Adjusted R-squared:  0.9195 
## F-statistic:  2161 on 1 and 188 DF,  p-value: < 2.2e-16
rmse <- function(predcol, ycol) {
  res = predcol-ycol
  sqrt(mean(res^2))
}


r_squared <- function(predcol, ycol) {
  tss = sum( (ycol - mean(ycol))^2 )
  rss = sum( (predcol - ycol)^2 )
  1 - rss/tss
}


# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata=mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
## [1] 1.202648
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.423936
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
## [1] 0.9199508
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.8834909
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

# There are several ways to implement an n-fold cross validation plan
# In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it

# kWayCrossValidation() creates a cross validation plan with the following call:
# splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
# where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

# Strictly speaking, dframe and y aren't used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions
# You can set them both to NULL

# The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
# train: the indices of dframe that will form the training set
# app: the indices of dframe that will form the test (or application) set


# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- vtreat::kWayCrossValidation(nRows, 3, NULL, NULL)

# Examine the split plan
str(splitPlan)
## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 2 4 5 7 8 9 10 11 12 13 ...
##   ..$ app  : int [1:78] 165 116 205 26 212 158 192 155 135 136 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 3 4 5 6 7 14 15 16 ...
##   ..$ app  : int [1:78] 124 231 196 54 107 222 173 87 13 9 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 6 8 9 10 11 12 13 15 ...
##   ..$ app  : int [1:78] 169 76 110 50 168 109 179 214 119 120 ...
##  - attr(*, "splitmethod")= chr "kwaycross"
# The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.264219

Chapter 3 - Issues to Consider

Categorical Inputs:

  • R converts categorical variables with N levels in to N-1 dummy (0/1) variables
    • The level not created for the model is the default value and is represented by the intercept
  • Too many levels can be a problem - computation issues, intepretation issues, over-fits, etc.
  • Generally, do not hash a factor variable in to a numeric variable, since High =3, Med = 2, Low = 1 is likely artificaly (e.g., High is unlikely to be 3x Low in response or intercept or the like)

Interactions:

  • Accounting for variables that do not contribute solely in a linear manner
  • There are several symbols for adding variable interactions
    • a:b means solely the interaction of a and b
    • a*b is the full interaction and main effects of a and b, meaning a + b + a:b
    • I(ab) means multiply ab and treat that as its own variable - do not make it an interaction
  • Cross-validation can help to understand the prediction quality using varying types of interactions

Transforming Response before Modeling:

  • The log-transform is a common transformation for managing long tails in the response variable
    • The log helps to make a more normal distribution, and better aligns the mean with the median
  • Generally, a three-step process for running a model with a transformed response variable
    • Log the outcome and fit a model - myLM <- lm(log(y) ~ x, data=)
    • Make predictions in the log space - myLogPred <- predict(myLM, newdata=)
    • Exponentiate the predictions to get back to the main scale - myPred <- exp(myLogPred)
  • Log transformations reduce magnitude of relative error [ 1 - y-hat/y ] rather than magnitude of absolute error [ y-hat - y ]
    • RMS-relative-error is sqrt( mean ( (pred - y) / y ) )

Transforming Inputs Before Modeling:

  • Domain knowledge frequently drives the decision to make a transformation to an input variable
  • Transformations are also sometimes needed in order to maintain consistency with modeling assumptions
    • The I() command in a formula means “treat the enclosed expression literally”
    • I(x**2) would mean x-squared

Example code includes:

# For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling
# The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace
# It has the following columns:
# Flowers: the average number of flowers on a meadowfoam plant
# Intensity: the intensity of a light treatment applied to the plant
# Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred
# The ultimate goal is to predict Flowers as a function of Time and Intensity

# The data frame flowers is in your workspace.

flowers <- data.frame(Flowers=c(62.3, 77.4, 55.3, 54.2, 49.6, 61.9, 39.4, 45.7, 31.3, 44.9, 36.8, 41.9, 77.8, 75.6, 69.1, 78, 57, 71.1, 62.9, 52.2, 60.3, 45.6, 52.6, 44.4), 
                      Intensity=c(150, 150, 300, 300, 450, 450, 600, 600, 750, 750, 900, 900, 150, 150, 300, 300, 450, 450, 600, 600, 750, 750, 900, 900), 
                      Time=c('Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early'),
                      stringsAsFactors=FALSE
                      )

# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame':    24 obs. of  3 variables:
##  $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
##  $ Intensity: num  150 150 300 300 450 450 600 600 750 750 ...
##  $ Time     : chr  "Late" "Late" "Late" "Late" ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] "Late"  "Early"
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data=flowers)

# Examine the first 20 lines of flowers
head(flowers, n=20)
##    Flowers Intensity  Time
## 1     62.3       150  Late
## 2     77.4       150  Late
## 3     55.3       300  Late
## 4     54.2       300  Late
## 5     49.6       450  Late
## 6     61.9       450  Late
## 7     39.4       600  Late
## 8     45.7       600  Late
## 9     31.3       750  Late
## 10    44.9       750  Late
## 11    36.8       900  Late
## 12    41.9       900  Late
## 13    77.8       150 Early
## 14    75.6       150 Early
## 15    69.1       300 Early
## 16    78.0       300 Early
## 17    57.0       450 Early
## 18    71.1       450 Early
## 19    62.9       600 Early
## 20    52.2       600 Early
# Examine the first 20 lines of mmat
head(mmat, n=20)
##    (Intercept) Intensity TimeLate
## 1            1       150        1
## 2            1       150        1
## 3            1       300        1
## 4            1       300        1
## 5            1       450        1
## 6            1       450        1
## 7            1       600        1
## 8            1       600        1
## 9            1       750        1
## 10           1       750        1
## 11           1       900        1
## 12           1       900        1
## 13           1       150        0
## 14           1       150        0
## 15           1       300        0
## 16           1       300        0
## 17           1       450        0
## 18           1       450        0
## 19           1       600        0
## 20           1       600        0
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, data=flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
##   (Intercept)   Intensity      TimeLate  
##  Min.   :1    Min.   :150   Min.   :0.0  
##  1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
##  Median :1    Median :525   Median :0.5  
##  Mean   :1    Mean   :525   Mean   :0.5  
##  3rd Qu.:1    3rd Qu.:750   3rd Qu.:1.0  
##  Max.   :1    Max.   :900   Max.   :1.0
# Use summary to examine flower_model 
summary(flower_model)
## 
## Call:
## lm(formula = fmla, data = flowers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
## Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
## TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 

# The data frame alcohol is in your workspace.
# alcohol is in the workspace
alcohol <- data.frame(Subject=1:32, 
                      Metabol=c(0.6, 0.6, 1.5, 0.4, 0.1, 0.2, 0.3, 0.3, 0.4, 1, 1.1, 1.2, 1.3, 1.6, 1.8, 2, 2.5, 2.9, 1.5, 1.9, 2.7, 3, 3.7, 0.3, 2.5, 2.7, 3, 4, 4.5, 6.1, 9.5, 12.3), 
                      Gastric=c(1, 1.6, 1.5, 2.2, 1.1, 1.2, 0.9, 0.8, 1.5, 0.9, 1.6, 1.7, 1.7, 2.2, 0.8, 2, 3, 2.2, 1.3, 1.2, 1.4, 1.3, 2.7, 1.1, 2.3, 2.7, 1.4, 2.2, 2, 2.8, 5.2, 4.1), 
                      Sex=c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male'), 
                      Alcohol=c('Alcoholic', 'Alcoholic', 'Alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic'), 
                      stringsAsFactors = TRUE
                      )


summary(alcohol)
##     Subject         Metabol          Gastric          Sex    
##  Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
##  1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
##  Median :16.50   Median : 1.700   Median :1.600              
##  Mean   :16.50   Mean   : 2.422   Mean   :1.863              
##  3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
##  Max.   :32.00   Max.   :12.300   Max.   :5.200              
##           Alcohol  
##  Alcoholic    : 8  
##  Non-alcoholic:24  
##                    
##                    
##                    
## 
# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- Metabol ~ Gastric + Gastric:Sex )
## Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add, data=alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, data=alcohol)

# Call summary on both models and compare
summary(model_add)
## 
## Call:
## lm(formula = fmla_add, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2779 -0.6328 -0.0966  0.5783  4.5703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
## Gastric       1.9656     0.2674   7.352 4.24e-08 ***
## SexMale       1.6174     0.5114   3.163 0.003649 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared:  0.7654, Adjusted R-squared:  0.7492 
## F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10
summary(model_interaction)
## 
## Call:
## lm(formula = fmla_interaction, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4656 -0.5091  0.0143  0.5660  4.0668 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.7504     0.5310  -1.413 0.168236    
## Gastric           1.1489     0.3450   3.331 0.002372 ** 
## Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared:  0.8081, Adjusted R-squared:  0.7948 
## F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11
# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- vtreat::kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  tidyr::gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol-pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
##   modeltype         rmse
##   <chr>            <dbl>
## 1 pred_add          1.64
## 2 pred_interaction  1.38
# The example (toy) dataset fdata is loaded in your workspace. It includes the columns:
# y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.
# pred: the predictions of a model that predicts y.
# label: categorical: whether y comes from a population that makes small purchases, or large ones.

fdata <- data.frame(y=c(9.15, 1.9, -3.86, 2.39, 1.54, 13.56, 10.2, 1.1, 3.94, 9.04, 1.73, 15.72, 2.26, -1.98, 1.1, 18.63, 3.68, 3.09, 8.69, 7.91, 5.44, 14.79, 9.02, 3.98, 2.67, 7.68, 11.93, 5.31, 13.06, 2.23, 15.4, -0.88, 7.61, 9.86, 4.36, 3.84, 11.34, 17.13, 16.17, -5.89, 12.64, 6.45, 2.97, 4.08, 5.52, 4.83, 6.72, 1.84, 3.2, 10.82, 1026.4, 202.39, 833.35, 1075.41, 96.12, 438.24, 911.33, 542.56, 686.33, 494.47, 422.81, 1033.88, 161.99, 491.43, 575.92, 384.77, 720.03, 963.94, 159.79, 765.4, 246.42, 1097.92, 1050.4, 1069.62, 116.81, 523.52, 457.74, 1060.56, 761.92, 969.32, 522.68, 475.87, 368.54, 1101.62, 1052.92, 663.04, 136.69, 331.81, 921.96, 773.58, 458.33, 643.47, 738.65, 846.42, 413.66, 180.99, 695.62, 164.9, 106.51, 358.36), 
                    pred=c(6.43, 3.47, 1.59, 3.76, 9.51, 6.93, 8.19, 1.51, 8.99, 6.15, 8.5, 10.94, 6, 1.07, 4.42, 10.52, 5.75, 7.09, 7.84, 4.31, 6, 8.31, 8.63, 2.98, 4.04, 7.46, 9.08, 6.52, 10.34, 4.24, 8.83, 1.17, 4.74, 8.43, 10.19, 4.33, 6.39, 11, 7.52, 2.45, 9.51, 5.21, 3.97, 6.32, 10.17, 3.85, 6.3, 3.65, 2.52, 8.02, 1027.19, 194.52, 826.25, 1081.44, 100.39, 430.29, 912.57, 533.95, 691.79, 498.01, 423.2, 1032.66, 168.55, 492.24, 589.71, 377.15, 730.11, 967.65, 159.21, 767.19, 250.13, 1098.9, 1048.87, 1057.99, 119.28, 524.56, 459.77, 1053.65, 751.11, 966.38, 520.61, 467.72, 364.21, 1097.07, 1054.63, 664.37, 137.49, 326.09, 929.97, 772.13, 456.43, 646.67, 747, 842.53, 411.89, 175.77, 687.86, 165.81, 108.48, 363.87), 
                    label=c(rep("small purchases", 50), rep("large purchases", 50)), 
                    stringsAsFactors=TRUE
                    )


# fdata is in the workspace
summary(fdata)
##        y                 pred                      label   
##  Min.   :  -5.890   Min.   :   1.070   large purchases:50  
##  1st Qu.:   5.407   1st Qu.:   6.372   small purchases:50  
##  Median :  57.375   Median :  55.695                       
##  Mean   : 306.203   Mean   : 305.904                       
##  3rd Qu.: 550.900   3rd Qu.: 547.890                       
##  Max.   :1101.620   Max.   :1098.900
# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))   # max of y
## # A tibble: 2 x 4
##   label              min   mean    max
##   <fct>            <dbl>  <dbl>  <dbl>
## 1 large purchases  96.1  606    1102  
## 2 small purchases - 5.89   6.48   18.6
# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = pred - y,  # Residual
                  relerr   = residual / y)  # Relative error

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean(residual ** 2)),   # RMSE
            rmse.rel = sqrt(mean(relerr ** 2)))   # Root mean squared relative error
## # A tibble: 2 x 3
##   label            rmse rmse.rel
##   <fct>           <dbl>    <dbl>
## 1 large purchases  5.54   0.0147
## 2 small purchases  4.02   1.25
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")

# In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the "log-money" predictions back into monetary units
# The data loaded into your workspace records subjects' incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:
# Arith
# Word
# Parag
# Math
# AFQT (Percentile on the Armed Forces Qualifying Test)

# The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace
# You will build a model of log(income) from the inputs, and then convert log(income) back into income.

# Examine Income2005 in the training set (do not have data)
# summary(income_train$Income2005)

# Write the formula for log income as a function of the tests and print it
# (fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)

# Fit the linear model
# model.log <-  lm(fmla.log, data=income_train)

# Make predictions on income_test
# income_test$logpred <- predict(model.log, newdata=income_test)
# summary(income_test$logpred)

# Convert the predictions to monetary units
# income_test$pred.income <- exp(income_test$logpred)
# summary(income_test$pred.income)

#  Plot predicted income (x axis) vs income
# ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
#   geom_point() + 
#   geom_abline(color = "blue")


# The income_train and income_test datasets are loaded in your workspace, along with your model, model.log
# Also in the workspace:
# model.abs: a model that directly fits income to the inputs using the formula Income2005 ~ Arith + Word + Parag + Math + AFQT

# fmla.abs is in the workspace
# fmla.abs

# model.abs is in the workspace
# summary(model.abs)

# Add predictions to the test set
# income_test <- income_test %>%
#   mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
#          pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

# Gather the predictions and calculate residuals and relative error
# income_long <- income_test %>% 
#   gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
#   mutate(residual = pred - Income2005,   # residuals
#          relerr   = residual / Income2005)   # relative error

# Calculate RMSE and relative RMSE and compare
# income_long %>% 
#   group_by(modeltype) %>%      # group by modeltype
#   summarize(rmse     = sqrt(mean(residual**2)),    # RMSE
#             rmse.rel = sqrt(mean(relerr**2)))    # Root mean squared relative error


# In this exercise, we will build a model to predict price from a measure of the house's size (surface area)
# The data set houseprice has the columns:
# price : house price in units of $1000
# size: surface area

# A scatterplot of the data shows that the data is quite non-linear: a sort of "hockey-stick" where price is fairly flat for smaller houses, but rises steeply as the house gets larger
# Quadratics and tritics are often good functional forms to express hockey-stick like relationships
# Note that there may not be a "physical" reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship

# You will fit a model to predict price as a function of the squared size, and look at its fit on the training data

# Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself
# exampleFormula = y ~ I(x^2)

houseprice <- data.frame(size=c(72, 98, 92, 90, 44, 46, 90, 150, 94, 90, 90, 66, 142, 74, 86, 46, 54, 130, 122, 118, 100, 74, 146, 92, 100, 140, 94, 90, 120, 70, 100, 132, 58, 92, 76, 90, 66, 134, 140, 64), 
                         price=c(156, 153, 230, 152, 42, 157, 113, 573, 202, 261, 175, 212, 486, 109, 220, 186, 133, 360, 283, 380, 185, 186, 459, 167, 171, 547, 170, 286, 293, 109, 205, 514, 175, 249, 234, 242, 177, 399, 511, 107)
                         )

# The data set houseprice is in the workspace.
# houseprice is in the workspace
summary(houseprice)
##       size           price      
##  Min.   : 44.0   Min.   : 42.0  
##  1st Qu.: 73.5   1st Qu.:164.5  
##  Median : 91.0   Median :203.5  
##  Mean   : 94.3   Mean   :249.2  
##  3rd Qu.:118.5   3rd Qu.:287.8  
##  Max.   :150.0   Max.   :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size**2))
## price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data=houseprice)

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin),       # predictions from linear model
           pred_sqr = predict(model_sqr)) %>%   # predictions from quadratic model 
    tidyr::gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- vtreat::kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  tidyr::gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred - price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = sqrt(mean(residuals**2)))
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred_lin   74.3
## 2 pred_sqr   63.7

Chapter 4 - Dealing with Non-Linear Responses

Logistic regression for predicting probabilities:

  • While predicting whether an event occurs is classification, predicting the probability that an event occurs is regression
    • The probabilities should be bounded between 0 and 1
  • The logistics regression is a counterpart to the standard linear regression
    • glm(formula, data, family=binomial)
    • Assumption that log(p / (1-p)) is linear in response to the predictor variables
    • log(p / (1-p)) is the “log-odds” model
    • Since the model is going to return probabilities, generally it is best to have input data where 1 means the thing you are interested in
  • Can get the predictions from a logistic regression similar to standard linear regression
    • predict(model, newdata, type=“response”) # The type=“response” is so that the model does not return its log-odds
  • Evaluating a logistic regression is frequently done by way of pseudo-R-squared
    • Deviance - analogous to RSS (goal is to minimize)
    • Null Deviance - analogous to TSS
    • pseudo-R-sqyared - 1 - Deviance / Null Deviance
  • The Gain curve can be helpful for assessing the performance of a logistic regression

Poisson and quasipoisson regression to predict counts:

  • Counts are restricted to being non-negative integers
    • glm(formula, data, family=) # The family can be “poisson” or “quasipoisson”
    • Since Poisson assumes mean and variance are equal, use quasipoisson if this is known to not be true
    • The model will return an expected number (count), but this expectation need not be an integer
    • These regressions work better on larger data sets, and may not be needed if the counts are all very large (where regular regression might work)
  • The pseudo-R-squared is again 1 - Deviance / Null Deviance
    • The predict function again needs to have type=response so that the actual prediction is returned

GAM to learn non-linear transformations:

  • The GAM is the Generalized Additive Model
    • Output depends on unknow smooth functions of the independent variables
    • The expression is gam(formula, family, data) # family can be gaussian (regular regression), binomial (probability), poisson, quasipoisson
    • The GAM has a tendency towards over-fitting, so it is frequently better with larger data sets
  • The s() function in the GAM means the variable should be non-linear - use with continuous variables (more than 10 unique values)
    • The s() function is fitting a spline between the input and the output
    • If a variable is not enclosed in the s() function, that signals that R should assume a linear relationship as to that specific variable
    • The plot(model) will give you the non-linear transformations from the model
    • The predict(model, type=“terms”) will give you the y-terms
    • The predict(model, newdata, type=“response”) will give you the predicted values
  • The GAM is often better than the linear model, but not as good as knowing the actual model transformation (e.g., if you know it is cubic, y ~ I(x**3) will typically perform better)

Example code includes:

# In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow
# The dataset sparrow is loaded into your workspace
# The outcome to be predicted is status ("Survived", "Perished")
# The variables we will consider are:
# total_length: length of the bird from tip of beak to tip of tail (mm)
# weight: in grams
# humerus : length of humerus ("upper arm bone" that connects the wing to the body) (inches)

# Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial: glm(formula, data = data, family = binomial)

# The data frame sparrow and the package broom are loaded in the workspace.

sparrow <- readRDS("./RInputFiles/sparrow.rds")

# sparrow is in the workspace
summary(sparrow)
##       status       age             total_length      wingspan    
##  Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
##  Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
##                Mode  :character   Median :160.0   Median :247.0  
##                                   Mean   :160.4   Mean   :247.5  
##                                   3rd Qu.:162.5   3rd Qu.:251.0  
##                                   Max.   :167.0   Max.   :256.0  
##      weight       beak_head        humerus           femur       
##  Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
##  1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
##  Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
##  Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
##  3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
##  Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
##     legbone          skull           sternum      
##  Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
##  1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
##  Median :1.130   Median :0.6000   Median :0.8500  
##  Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
##  3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
##  Max.   :1.230   Max.   :0.6400   Max.   :0.9300
# Create the survived column
sparrow$survived <- sparrow$status == "Survived"

# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
## survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, data=sparrow, family="binomial")

# Call summary
summary(sparrow_model)
## 
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1117  -0.6026   0.2871   0.6577   1.7082  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   46.8813    16.9631   2.764 0.005715 ** 
## total_length  -0.5435     0.1409  -3.858 0.000115 ***
## weight        -0.5689     0.2771  -2.053 0.040060 *  
## humerus       75.4610    19.1586   3.939 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  75.094  on 83  degrees of freedom
## AIC: 83.094
## 
## Number of Fisher Scoring iterations: 5
# Call glance
(perf <- broom::glance(sparrow_model))
##   null.deviance df.null    logLik      AIC      BIC deviance df.residual
## 1      118.0084      86 -37.54718 83.09436 92.95799 75.09436          83
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
## [1] 0.3636526
# Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:
# predict(model, type = "response")
# Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

# You will also use the GainCurvePlot() function to plot the gain curve from the model predictions
# GainCurvePlot(frame, xvar, truthVar, title)

# Make predictions
sparrow$pred <- predict(sparrow_model, type="response")

# Look at gain curve
WVPlots::GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

# In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day
# You will train the model on data from the month of July

# Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model
# Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.

# The data frame bikesJuly is in the workspace
# The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively

load("./RInputFiles/Bikes.RData")

outcome <- "cnt"
vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")

# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
# Fit the model
bike_model <- glm(fmla, data=bikesJuly, family=quasipoisson)

# Call glance
(perf <- broom::glance(bike_model))
##   null.deviance df.null logLik AIC BIC deviance df.residual
## 1      133364.9     743     NA  NA  NA  28774.9         712
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
## [1] 0.7842393
# In this exercise you will use the model you built in the previous exercise to make predictions for the month of August
# The data set bikesAugust has the same columns as bikesJuly

# Recall that you must specify type = "response" with predict() when predicting counts from a glm poisson or quasipoisson model

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
## 
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.6117   -4.3121   -0.7223    3.5507   16.5079  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.934986   0.439027  13.519  < 2e-16 ***
## hr1                           -0.580055   0.193354  -3.000 0.002794 ** 
## hr2                           -0.892314   0.215452  -4.142 3.86e-05 ***
## hr3                           -1.662342   0.290658  -5.719 1.58e-08 ***
## hr4                           -2.350204   0.393560  -5.972 3.71e-09 ***
## hr5                           -1.084289   0.230130  -4.712 2.96e-06 ***
## hr6                            0.211945   0.156476   1.354 0.176012    
## hr7                            1.211135   0.132332   9.152  < 2e-16 ***
## hr8                            1.648361   0.127177  12.961  < 2e-16 ***
## hr9                            1.155669   0.133927   8.629  < 2e-16 ***
## hr10                           0.993913   0.137096   7.250 1.09e-12 ***
## hr11                           1.116547   0.136300   8.192 1.19e-15 ***
## hr12                           1.282685   0.134769   9.518  < 2e-16 ***
## hr13                           1.273010   0.135872   9.369  < 2e-16 ***
## hr14                           1.237721   0.136386   9.075  < 2e-16 ***
## hr15                           1.260647   0.136144   9.260  < 2e-16 ***
## hr16                           1.515893   0.132727  11.421  < 2e-16 ***
## hr17                           1.948404   0.128080  15.212  < 2e-16 ***
## hr18                           1.893915   0.127812  14.818  < 2e-16 ***
## hr19                           1.669277   0.128471  12.993  < 2e-16 ***
## hr20                           1.420732   0.131004  10.845  < 2e-16 ***
## hr21                           1.146763   0.134042   8.555  < 2e-16 ***
## hr22                           0.856182   0.138982   6.160 1.21e-09 ***
## hr23                           0.479197   0.148051   3.237 0.001265 ** 
## holidayTRUE                    0.201598   0.079039   2.551 0.010961 *  
## workingdayTRUE                 0.116798   0.033510   3.485 0.000521 ***
## weathersitLight Precipitation -0.214801   0.072699  -2.955 0.003233 ** 
## weathersitMisty               -0.010757   0.038600  -0.279 0.780572    
## temp                          -3.246001   1.148270  -2.827 0.004833 ** 
## atemp                          2.042314   0.953772   2.141 0.032589 *  
## hum                           -0.748557   0.236015  -3.172 0.001581 ** 
## windspeed                      0.003277   0.148814   0.022 0.982439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 38.98949)
## 
##     Null deviance: 133365  on 743  degrees of freedom
## Residual deviance:  28775  on 712  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, newdata=bikesAugust, type="response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual**2)))
##       rmse
## 1 112.5815
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

# In the previous exercise, you visualized the bike model's predictions using the standard "outcome vs. prediction" scatter plot
# Since the bike rental data is time series data, you might be interested in how the model performs as a function of time
# In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August

# To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column
# gather() takes as arguments:
# The "wide" data frame to be gathered (implicit in a pipe)
# The name of the key column to be created - contains the names of the gathered columns.
# The name of the value column to be created - contains the values of the gathered columns.
# The names of the columns to be gathered into a single column.

# You'll use the gathered data frame to compare the actual and predicted rental counts as a function of time
# The time index, instant counts the number of observations since the beginning of data collection
# The sample code converts the instants to daily units, starting from 0

# The data frame bikesAugust, with the predictions (bikesAugust$pred) is in the workspace.

# Plot predictions and cnt by date/time
quasipoisson_plot <- bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model")

quasipoisson_plot

# In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting)
# As you will see, the soybean plant doesn't grow at a steady rate, but rather has a "growth spurt" that eventually tapers off
# Hence, leaf weight is not well described by a linear model.

# Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:

# Also remember that gam() from the package mgcv has the calling interface gam(formula, family, data)
# For standard regression, use family = gaussian (the default).

# The soybean training data, soybean_train is loaded into your workspace
# It has two columns: the outcome weight and the variable Time
# For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well

load("./RInputFiles/Soybean.RData")

# soybean_train is in the workspace
summary(soybean_train)
##       Plot     Variety   Year          Time           weight       
##  1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
##  1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
##  1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
##  1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
##  1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
##  1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
##  (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point()

# Create the formula 
(fmla.gam <- weight ~ s(Time) )
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- mgcv::gam(fmla.gam, data=soybean_train, family="gaussian")

# Call summary() on model.lin and look for R-squared
model.lin <- lm(weight ~ Time, data=soybean_train)
summary(model.lin)
## 
## Call:
## lm(formula = weight ~ Time, data = soybean_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3933 -1.7100 -0.3909  1.9056 11.4381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
## Time         0.292094   0.007444   39.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8238 
## F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## weight ~ s(Time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.1645     0.1143   53.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(Time) 8.495   8.93 338.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.902   Deviance explained = 90.4%
## GCV = 4.4395  Scale est. = 4.3117    n = 330
# Call plot() on model.gam
plot(model.gam)

# The data frame soybean.test and the models model.lin and model.gam are in the workspace

# soybean_test is in the workspace
summary(soybean_test)
##       Plot    Variety   Year         Time           weight       
##  1988F8 : 4   F:43    1988:32   Min.   :14.00   Min.   : 0.0380  
##  1988P7 : 4   P:39    1989:26   1st Qu.:23.00   1st Qu.: 0.4248  
##  1989F8 : 4           1990:24   Median :41.00   Median : 3.0025  
##  1990F8 : 4                     Mean   :44.09   Mean   : 7.1576  
##  1988F4 : 3                     3rd Qu.:69.00   3rd Qu.:15.0113  
##  1988F2 : 3                     Max.   :84.00   Max.   :30.2717  
##  (Other):60
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  tidyr::gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%      # residuals
  group_by(modeltype) %>%                   # group by modeltype
  summarize(rmse = sqrt(mean(residual**2))) # calculate the RMSE
## # A tibble: 2 x 2
##   modeltype  rmse
##   <chr>     <dbl>
## 1 pred.gam   2.29
## 2 pred.lin   3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")


Chapter 5 - Tree Based Models

The intuition behind tree-based models:

  • Decision trees learn rules to help find non-linear or non-additive relationships
    • Trees have an expressive concept space
    • However, trees give coarse-grained predictions (not that may predictive buckets)
  • Ensemble models made up of many trees (random forests, gradient boosted) will typically perform better than just a simple tree

Random forests:

  • Multiple trees from the training data, and then averaging the results together
    • Finer grained performance, with lower risk of over-fits
  • The overall process for growing a random forest is to
    • Draw bootstrapped samples from the training data
    • For each sample, grow a tree, splitting on the best variable (from a random subset chosen for that tree), and continuing until the tree is grown
    • To score a datum, evaluate it with all the trees and average the results
  • Bike rental data example - fit on data from January and predict to data in February
    • Can run using ranger::ranger(formula, data, num.trees=500, respect.unordered.factors= “order”)
    • respect.unordered.factors= “order” will safely encode categorical variables and tends to run faster
    • The number of trees should be 200+, with the default being 500
  • Printing a random forest model from ranger will show a projected OOB (out of bag) error
  • Can get the predictions using predict(myRanger, myNewData)$predictions

One-Hot-Encoding caregorical variables:

  • Most R functions manage the conversion of categorical variables to dummy variables easily using model.matrix()
    • However, xgboost() does NOT do this, and requires that categorical variables be converted to numerical variables prior to running the code
    • The conversion of a categorical variable to dummy variables is known as one-hot-encoding
  • Can run the one-hot-encoding process using vtreat
    • designTreatmentsZ(dframe, varlist) to design a treatment plan for the training data
    • prepare() to create clean data - all numerical, no missing values, function can be used with all future data
    • Important to run the same prepare() with the same designTreatmentsZ() on both training and test data

Gradient boosting machines:

  • Incremental improvement of existing models
    • Fit a shallow tree to a set of data
    • Fit a second tree to the residuals
    • Calculate a weighted average of the first tree and the second tree
    • Continue until a stopping condition is hit - number of iterations, error, test-set error
    • Learning rate (eta) can be between 0 and 1, with large rates associated with both faster learning and greater risk of overfits
  • Best practices with xgboost include
    • Run xgb.cv() with a large number of rounds (trees)
    • Investigate xgb.cv()$evaluation_log (records estimated RMSE for each round)
    • Run xgboost(), setting nrounds equal to whatever parameter was best in the xgb.cv()$evaluation_log records
  • Models frequently do much better on RMSE than even random forest models

Example code includes:

# Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.
# The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.

# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 1804240829

# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## OOB prediction error (MSE):       8298.542 
## R squared (OOB):                  0.8190613
# In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.

# The predict() function for a ranger model produces a list
# One of the elements of this list is predictions, a vector of predicted values
# You can access predictions with the $ notation for accessing named elements of a list: predict(model, data)$predictions

# The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  13 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pred      : num  94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## OOB prediction error (MSE):       8298.542 
## R squared (OOB):                  0.8190613
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual**2)))      # calculate rmse
##       rmse
## 1 96.73917
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

# The data frame bikesAugust (with predictions) is in the workspace
# The plot quasipoisson_plot of quasipoisson model predictions as a function of time is also in the workspace

# Print quasipoisson_plot
plot(quasipoisson_plot)

# Plot predictions and cnt by date/time
randomforest_plot <- bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

randomforest_plot

# In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example
# vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

# To design a treatment plan use the function designTreatmentsZ()
# treatplan <- designTreatmentsZ(data, varlist)
# data: the original training data frame
# varlist: a vector of input variables to be treated (as strings)

# designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
# scoreFrame <- treatplan %>% magrittr::use_series(scoreFrame) %>% select(varName, origName, code)
# varName: the name of the new treated variable
# origName: the name of the original variable that the treated variable comes from
# code: the type of the new variable.
# "clean": a numerical variable with no NAs or NaNs
# "lev": an indicator variable for a specific level of the original categorical variable
# (magrittr::use_series() is an alias for $ that you can use in pipes.)

# For these exercises, we want varName where code is either "clean" or "lev"
# newvarlist <- scoreFrame %>% filter(code %in% c("clean", "lev") %>% magrittr::use_series(varName)
# To transform the data set into all numerical and one-hot-encoded variables, use prepare(): data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
# treatplan: the treatment plan
# data: the data frame to be treated
# varRestrictions: the variables desired in the treated data

# The data frame dframe and the package magrittr are loaded in the workspace.

# dframe is in the workspace
dframe <- data.frame(color=c('b', 'r', 'r', 'r', 'r', 'b', 'r', 'g', 'b', 'b'), 
                     size=c(13, 11, 15, 14, 13, 11, 9, 12, 7, 12), 
                     popularity=c(1.079, 1.396, 0.922, 1.203, 1.084, 0.804, 1.104, 0.875, 0.695, 0.883), 
                     stringAsFactors=TRUE
                     )
dframe
##    color size popularity stringAsFactors
## 1      b   13      1.079            TRUE
## 2      r   11      1.396            TRUE
## 3      r   15      0.922            TRUE
## 4      r   14      1.203            TRUE
## 5      r   13      1.084            TRUE
## 6      b   11      0.804            TRUE
## 7      r    9      1.104            TRUE
## 8      g   12      0.875            TRUE
## 9      b    7      0.695            TRUE
## 10     b   12      0.883            TRUE
# Create and print a vector of variable names
(vars <- c("color", "size"))
## [1] "color" "size"
# Load the package vtreat
# library(vtreat)

# Create the treatment plan
treatplan <- vtreat::designTreatmentsZ(dframe, vars)
## [1] "designing treatments Tue May 22 12:17:50 2018"
## [1] "designing treatments Tue May 22 12:17:50 2018"
## [1] " have level statistics Tue May 22 12:17:50 2018"
## [1] "design var color Tue May 22 12:17:50 2018"
## [1] "design var size Tue May 22 12:17:50 2018"
## [1] " scoring treatments Tue May 22 12:17:50 2018"
## [1] "have treatment plan Tue May 22 12:17:50 2018"
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code))
##         varName origName  code
## 1 color_lev_x.b    color   lev
## 2 color_lev_x.g    color   lev
## 3 color_lev_x.r    color   lev
## 4    color_catP    color  catP
## 5    size_clean     size clean
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName))
## [1] "color_lev_x.b" "color_lev_x.g" "color_lev_x.r" "size_clean"
# Create the treated training data
(dframe.treat <- vtreat::prepare(treatplan, dframe, varRestriction = newvars))
##    color_lev_x.b color_lev_x.g color_lev_x.r size_clean
## 1              1             0             0         13
## 2              0             0             1         11
## 3              0             0             1         15
## 4              0             0             1         14
## 5              0             0             1         13
## 6              1             0             0         11
## 7              0             0             1          9
## 8              0             1             0         12
## 9              1             0             0          7
## 10             1             0             0         12
# When a level of a categorical variable is rare, sometimes it will fail to show up in training data
# If that rare level then appears in future data, downstream models may not know what to do with it
# When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

# vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely
# vtreat also manages missing values in the data (both categorical and continuous).

# In this exercise you will see how vtreat handles categorical values that did not appear in the training set
# The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace
# dframe and a new data frame testframe are also in your workspace

# treatplan is in the workspace
summary(treatplan)
##               Length Class           Mode     
## treatments    3      -none-          list     
## scoreFrame    8      data.frame      list     
## outcomename   1      -none-          character
## vtreatVersion 1      package_version list     
## outcomeType   1      -none-          character
## outcomeTarget 1      -none-          character
## meanY         1      -none-          logical  
## splitmethod   1      -none-          character
# newvars is in the workspace
(newvars <- c('color_lev_x.b', 'color_lev_x.g', 'color_lev_x.r', 'size_clean'))
## [1] "color_lev_x.b" "color_lev_x.g" "color_lev_x.r" "size_clean"
# Print dframe and testframe
testframe <- data.frame(color=c('g', 'g', 'y', 'g', 'g', 'y', 'b', 'g', 'g', 'r'), 
                        size=c(7, 8, 10, 12, 6, 8, 12, 12, 12, 8), 
                        popularity=c(0.973, 0.912, 1.422, 1.191, 0.987, 1.37, 1.096, 0.916, 1, 1.314), 
                        stringAsFactors=TRUE
                        )
testframe
##    color size popularity stringAsFactors
## 1      g    7      0.973            TRUE
## 2      g    8      0.912            TRUE
## 3      y   10      1.422            TRUE
## 4      g   12      1.191            TRUE
## 5      g    6      0.987            TRUE
## 6      y    8      1.370            TRUE
## 7      b   12      1.096            TRUE
## 8      g   12      0.916            TRUE
## 9      g   12      1.000            TRUE
## 10     r    8      1.314            TRUE
# Use prepare() to one-hot-encode testframe
(testframe.treat <- vtreat::prepare(treatplan, testframe, varRestriction = newvars))
##    color_lev_x.b color_lev_x.g color_lev_x.r size_clean
## 1              0             1             0          7
## 2              0             1             0          8
## 3              0             0             0         10
## 4              0             1             0         12
## 5              0             1             0          6
## 6              0             0             0          8
## 7              1             0             0         12
## 8              0             1             0         12
## 9              0             1             0         12
## 10             0             0             1          8
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"
# Create the treatment plan from bikesJuly (the training data)
treatplan <- vtreat::designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("clean", "lev")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column
##  [1] "hr_lev_x.0"                             
##  [2] "hr_lev_x.1"                             
##  [3] "hr_lev_x.10"                            
##  [4] "hr_lev_x.11"                            
##  [5] "hr_lev_x.12"                            
##  [6] "hr_lev_x.13"                            
##  [7] "hr_lev_x.14"                            
##  [8] "hr_lev_x.15"                            
##  [9] "hr_lev_x.16"                            
## [10] "hr_lev_x.17"                            
## [11] "hr_lev_x.18"                            
## [12] "hr_lev_x.19"                            
## [13] "hr_lev_x.2"                             
## [14] "hr_lev_x.20"                            
## [15] "hr_lev_x.21"                            
## [16] "hr_lev_x.22"                            
## [17] "hr_lev_x.23"                            
## [18] "hr_lev_x.3"                             
## [19] "hr_lev_x.4"                             
## [20] "hr_lev_x.5"                             
## [21] "hr_lev_x.6"                             
## [22] "hr_lev_x.7"                             
## [23] "hr_lev_x.8"                             
## [24] "hr_lev_x.9"                             
## [25] "holiday_clean"                          
## [26] "workingday_clean"                       
## [27] "weathersit_lev_x.Clear.to.partly.cloudy"
## [28] "weathersit_lev_x.Light.Precipitation"   
## [29] "weathersit_lev_x.Misty"                 
## [30] "temp_clean"                             
## [31] "atemp_clean"                            
## [32] "hum_clean"                              
## [33] "windspeed_clean"
# Prepare the training data
bikesJuly.treat <- vtreat::prepare(treatplan, bikesJuly, varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- vtreat::prepare(treatplan, bikesAugust, varRestriction = newvars)

# Call str() on the treated data
str(bikesJuly.treat)
## 'data.frame':    744 obs. of  33 variables:
##  $ hr_lev_x.0                             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.1                             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.10                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.11                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.12                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.13                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.14                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.15                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.16                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.17                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.18                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.19                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.2                             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.20                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.21                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.22                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.23                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.3                             : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ hr_lev_x.4                             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ hr_lev_x.5                             : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ hr_lev_x.6                             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ hr_lev_x.7                             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hr_lev_x.8                             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ hr_lev_x.9                             : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ holiday_clean                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday_clean                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit_lev_x.Clear.to.partly.cloudy: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weathersit_lev_x.Light.Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit_lev_x.Misty                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ temp_clean                             : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp_clean                            : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum_clean                              : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed_clean                        : num  0 0.1343 0.0896 0.1343 0.194 ...
str(bikesAugust.treat)
## 'data.frame':    744 obs. of  33 variables:
##  $ hr_lev_x.0                             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.1                             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.10                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.11                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.12                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.13                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.14                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.15                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.16                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.17                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.18                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.19                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.2                             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.20                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.21                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.22                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.23                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_lev_x.3                             : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ hr_lev_x.4                             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ hr_lev_x.5                             : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ hr_lev_x.6                             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ hr_lev_x.7                             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hr_lev_x.8                             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ hr_lev_x.9                             : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ holiday_clean                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday_clean                       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weathersit_lev_x.Clear.to.partly.cloudy: num  1 1 1 1 0 0 1 0 0 0 ...
##  $ weathersit_lev_x.Light.Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit_lev_x.Misty                 : num  0 0 0 0 1 1 0 1 1 1 ...
##  $ temp_clean                             : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp_clean                            : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum_clean                              : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed_clean                        : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
# In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day
# You will train the model on data from the month of July.

# The July data is loaded into your workspace
# Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt

# You will use the xgboost package to fit the random forest model
# The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model
# The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE

# For this exercise, the key arguments to the xgb.cv() call are:
# data: a numeric matrix.
# label: vector of outcomes (also numeric).
# nrounds: the maximum number of rounds (trees to build).
# nfold: the number of folds for the cross-validation. 5 is a good number.
# objective: "reg:linear" for continuous outcomes.
# eta: the learning rate.
# max_depth: depth of trees.
# early_stopping_rounds: after this many rounds without improvement, stop.
# verbose: 0 to stay silent

# The data frames bikesJuly and bikesJuly.treat are in the workspace


# Load the package xgboost
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat), 
            label = bikesJuly$cnt,
            nrounds = 100,
            nfold = 5,
            objective = "reg:linear",
            eta = 0.3,
            max_depth = 6,
            early_stopping_rounds = 10,
            verbose = 0    # silent
)

# Get the evaluation log 
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
   summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
             ntrees.test  = which.min(test_rmse_mean))   # find the index of min(test_rmse_mean)
##   ntrees.train ntrees.test
## 1           77          67
# In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day
# You will train the model on data from the month of July and predict on data for the month of August

# The datasets for July and August are loaded into your workspace
# Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column)

# For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace

# The data frames bikesJuly, bikesJuly.treat, bikesAugust and bikesAugust.treat are in the workspace. The number of trees ntrees (84) is in the workspace

# The number of trees to use, as determined by xgb.cv
(ntrees <- 84)
## [1] 84
# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame':    744 obs. of  13 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pred      : num  48.548 35.349 0.625 -6.652 3.563 ...
# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals**2)))
##       rmse
## 1 76.36407
# Print quasipoisson_plot
quasipoisson_plot

# Print randomforest_plot
randomforest_plot

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")


Machine Learning with Tree-Based Models in R

Chapter 1 - Classification Trees

Overview - supervised learning process using classification trees:

  • Supervised learning typically involves a train/test process
  • Tree-based models combine interpretability, ease-of-use, and accuracy
    • They are useful for making decisions and numeric predictions
  • Course will include many topics related to trees
    • Interpret and explain trees
    • Explore use cases
    • Build and evaluate trees for both classification and regression
    • Tune model parameters for optimal performance
  • Several types of tree-based models
    • Classification and regression
    • Bagged trees
    • Random forests
    • Boosted trees (GBM)
  • Terminal nodes are also known as “leaf nodes”, and nodes at the type are known as “root nodes”
  • The rpart library is commonly used for classification and regression tree-based models
    • rpart::rpart(response ~ ., data=)

Introduction to classification trees:

  • Advantages of trees - simple and easy to explain/visualize, can handle categorical variables natively, handle missing data fairly elegantly (usually), robust to outliers (requires less pre-processing)
    • Additionally, can model non-linearity and is fast to implement even on large test datasets
  • Disadvantages of trees - high variance (tree is very influenced by the test dataset; risk of over-fitting), can be hard to interpret if deep

Overview of the modeling process:

  • Typically, use an 80/20 split for Train-Test for building a tree-based model
    • Cross-validation can extend this approach further
    • Begin by getting the number of rows in the data, and a target train size, and then randomize a sample of 1:nrow() to length target_train
  • Training a tree in R requires a formula, data, and methods
    • rpart(y ~ myX, data=, method=)
    • method = “class” is for making categorical predictions

Evaluating classification model performance:

  • The tree predictions can be made using predict(object, newdata, type=)
    • If type=“class”, a prediction will be made for the class, rather than probabilities by class
    • If type=“prob”, a matrix of predicted probabilities by class will be returned
  • There are many potential metrics for binary classification of a tree-based model
    • Accuracy, Confusion Matrix, Log-loss, AUC, etc.
    • Accuracy measures how often the classifier predicts the class correctly - nCorrect / nRows
    • Confusion matrix gives the accuracy (and other statistics) by actual/predicted values - columns are true labels, rows are predicted labels
  • Can use caret::confusionMatrix(data=, reference=) # data is the prediction, reference is the true value

Splitting criteria in trees:

  • The classification process starts from the root node and works downward, fulfilling the specified split condition
  • The tree is in essence splitting the space in to decision boundaries - objective is to have the sub-regions that are as “pure” as possible
    • A measure for purity is helpful, and the maths are easier for measuring the impurities
    • The Gini index (lower means more pure) is typically used to determine the impurity of a node

Example code includes:

# Let's get started and build our first classification tree
# A classification tree is a decision tree that performs a classification (vs regression) task

# You will train a decision tree model to understand which loan applications are at higher risk of default using a subset of the German Credit Dataset
# The response variable, called "default", indicates whether the loan went into a default or not, which means this is a binary classification problem (there are just two classes)

# You will use the rpart package to fit the decision tree and the rpart.plot package to visualize the tree

# The data frame creditsub is in the workspace
# This data frame is a subset of the original German Credit Dataset, which we will use to train our first classification tree model

credit <- read.csv("./RInputFiles/credit.csv")
creditsub <- credit %>%
    select(months_loan_duration, percent_of_income, years_at_residence, age, default)

# Look at the data
str(creditsub, give.attr=FALSE)
## 'data.frame':    1000 obs. of  5 variables:
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
# Create the model
credit_model <- rpart::rpart(formula = default ~ ., 
                      data = creditsub, 
                      method = "class")

# Display the results
rpart.plot::rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)

# For this exercise, you'll randomly split the German Credit Dataset into two pieces: a training set (80%) called credit_train and a test set (20%) that we will call credit_test
# We'll use these two sets throughout the chapter.

# The credit data frame is loaded into the workspace.

# Total number of rows in the credit data frame
n <- nrow(credit)

# Number of rows for the training set (80% of the dataset)
n_train <- round(0.8 * n) 

# Create a vector of indices which is an 80% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)

# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]  
  
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]  


# Train the model (to predict 'default')
credit_model <- rpart::rpart(formula = default ~ ., 
                      data = credit_train, 
                      method = "class")

# Look at the model output                      
print(credit_model)
## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 800 238 no (0.70250000 0.29750000)  
##     2) checking_balance=> 200 DM,unknown 369  45 no (0.87804878 0.12195122) *
##     3) checking_balance=< 0 DM,1 - 200 DM 431 193 no (0.55220418 0.44779582)  
##       6) months_loan_duration< 20.5 231  84 no (0.63636364 0.36363636)  
##        12) credit_history=critical,good,poor 207  66 no (0.68115942 0.31884058)  
##          24) amount< 7341 200  60 no (0.70000000 0.30000000) *
##          25) amount>=7341 7   1 yes (0.14285714 0.85714286) *
##        13) credit_history=perfect,very good 24   6 yes (0.25000000 0.75000000) *
##       7) months_loan_duration>=20.5 200  91 yes (0.45500000 0.54500000)  
##        14) savings_balance=> 1000 DM,unknown 35   9 no (0.74285714 0.25714286)  
##          28) amount>=2079 26   2 no (0.92307692 0.07692308) *
##          29) amount< 2079 9   2 yes (0.22222222 0.77777778) *
##        15) savings_balance=< 100 DM,100 - 500 DM,500 - 1000 DM 165  65 yes (0.39393939 0.60606061)  
##          30) months_loan_duration< 47.5 132  60 yes (0.45454545 0.54545455)  
##            60) age>=29.5 77  35 no (0.54545455 0.45454545)  
##             120) amount>=2249 62  24 no (0.61290323 0.38709677)  
##               240) credit_history=critical,poor,very good 25   5 no (0.80000000 0.20000000) *
##               241) credit_history=good,perfect 37  18 yes (0.48648649 0.51351351)  
##                 482) age< 41 21   7 no (0.66666667 0.33333333) *
##                 483) age>=41 16   4 yes (0.25000000 0.75000000) *
##             121) amount< 2249 15   4 yes (0.26666667 0.73333333) *
##            61) age< 29.5 55  18 yes (0.32727273 0.67272727)  
##             122) months_loan_duration< 31.5 38  16 yes (0.42105263 0.57894737)  
##               244) amount>=3415 17   6 no (0.64705882 0.35294118) *
##               245) amount< 3415 21   5 yes (0.23809524 0.76190476) *
##             123) months_loan_duration>=31.5 17   2 yes (0.11764706 0.88235294) *
##          31) months_loan_duration>=47.5 33   5 yes (0.15151515 0.84848485) *
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,  
                            newdata = credit_test,   
                            type = "class")  
                            
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,       
                reference = credit_test$default)  
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  125  46
##        yes  13  16
##                                           
##                Accuracy : 0.705           
##                  95% CI : (0.6366, 0.7672)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.3543          
##                                           
##                   Kappa : 0.192           
##  Mcnemar's Test P-Value : 3.099e-05       
##                                           
##             Sensitivity : 0.9058          
##             Specificity : 0.2581          
##          Pos Pred Value : 0.7310          
##          Neg Pred Value : 0.5517          
##              Prevalence : 0.6900          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.8550          
##       Balanced Accuracy : 0.5819          
##                                           
##        'Positive' Class : no              
## 
# Train two models that use a different splitting criterion and use the validation set to choose a "best" model from this group
# To do this you'll use the parms argument of the rpart() function
# This argument takes a named list that contains values of different parameters you can use to change how the model is trained
# Set the parameter split to control the splitting criterion

# The datasets credit_test and credit_train have already been loaded for you

# Train a gini-based model
credit_model1 <- rpart::rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "gini"))

# Train an information-based model
credit_model2 <- rpart::rpart(formula = default ~ ., 
                       data = credit_train, 
                       method = "class",
                       parms = list(split = "information"))

# Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1, 
                 newdata = credit_test,
                 type = "class")

# Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2, 
                 newdata = credit_test,
                 type = "class")

dt_preds <- predict(object = credit_model1, newdata = credit_test)[, "yes"]
    
# Compare classification error
Metrics::ce(actual = credit_test$default, 
   predicted = pred1)
## [1] 0.295
Metrics::ce(actual = credit_test$default, 
   predicted = pred2)  
## [1] 0.275

Chapter 2 - Regression Trees

Introduction to regression trees:

  • The goal is regression is to predict a value based on inputs - for example, predicting weight
    • Can be a variable that is underlying continuous or underlying discrete (such as Poisson or count variables)
  • Numeric decision trees are measured based on RMSE and other measures or variance or deviation from mean
  • The regression tree can be run as rpart(formula=, data=, method=, control=)
    • The method=“anova” will run a regression tree, in contrast to method=“class”
  • There is a somewhat modified process for the test-train process in this case - becomes test, train, validation
    • The validation set is used to help tune parameters, while the test set is used to give a final out-of-model error estimate (test set should only be used once)

Performance metrics for regression:

  • Two popular metrics include Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
    • RMSE is the sample standard deviation of the errors
    • Both metrics are indifferent to the direction (sign) of the errors, and both are in the units of the dependent variable
    • RMSE particularly punishes big misses, and is especially preferred when big misses are large problems
  • The package Metrics has some useful functions for assessing model performance
    • All of the functions in Metrics take a vector of actuals and then a vector of predicted
    • Metrics::rmse(actual=, predicted=) will calculate the RMSE on a specified data set

Hyper-parameters for a decision tree:

  • The default parameters are chosen to do a reasonable job on the average dataset, but they can often be improved
    • The main way to change hyper-parameters is to specify the control= argument, choosing from the list as available in ?rpart.control
    • minsplit: minimum number of data points to attempt a split (default 20)
    • cp: complexity parameter (default 0.1) - this is a penalty term to control tree size, so smaller values will drive more complex trees
    • maxdepth: maximum number of intermediate nodes (default 30 is already fairly liberal)
  • Can identify the best cp parameter and prune the tree accordingly
    • print(model$cptable) will show various cp options and the associated errors
    • prune(tree=model, cp=) will prune the tree based on the parameter entered for cp (often the lowest-error cp from the process descrbed above)

Grid search for model selection:

  • Training a sequence of models, and identifying the best set, is a common process
    • Goal is to test a grid of hyper-parameters, and to choose the parameters that come back with the “winning” result
  • Grid search is also called hyper-parameter search, and is a search through a list of potential hyper-parameter values
    • The “grid” is the set of hyper-parameter models that you iterate over during the process of model-building
    • Models are evaluated against a specified metrics, with CV and/or hold-out being used to assess the value for the metric
    • Some guesswork is needed to specify the grid; sometimes helpful to test a small grid, and expand around where there seems to be a minima
  • While the grid search is available through the caret package, this course will build the equivalent using a for-loop structure

Example code includes:

# These examples will use a subset of the Student Performance Dataset from UCI ML Dataset Repository

# The goal of this exercise is to predict a student's final Mathematics grade based on the following variables:
# sex, age, address, studytime (weekly study time), schoolsup (extra educational support), famsup (family educational support), paid (extra paid classes within the course subject) and absences

# The response is final_grade (numeric: from 0 to 20, output target).

# After the initial exploration, let's split the data into training, validation, test sets
# In this chapter, we will introduce the idea of a validation set, which can be used to select a "best" model from a set of competing models

# In Chapter 1, we demonstrated a simple way to split the data into two pieces using the sample() function
# In this exercise, we will take a slightly different approach to splitting the data that allows us to split the data into more than two parts (here we want three parts: train, validation, test)
# We still use the sample() function, but instead of sampling the indices themselves, we use the sample() function to assign each row to either the training, validation or test sets according to a probability distribution

# The dataset grade is already in your workspace.
grade <- read.csv("./RInputFiles/grade.csv")

# Look/explore the data
str(grade)
## 'data.frame':    395 obs. of  8 variables:
##  $ final_grade: num  3 3 5 7.5 5 7.5 5.5 3 9.5 7.5 ...
##  $ age        : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address    : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ studytime  : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup  : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup     : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid       : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ absences   : int  6 4 10 2 4 10 0 6 0 0 ...
# Randomly assign rows to ids (1/2/3 represents train/valid/test)
# This will generate a vector of ids of length equal to the number of rows
# The train/valid/test split will be approximately 70% / 15% / 15% 
set.seed(1)
assignment <- sample(1:3, size = nrow(grade), prob = c(0.7, 0.15, 0.15), replace = TRUE)

# Create a train, validation and tests from the original data frame 
grade_train <- grade[assignment == 1, ]    # subset the grade data frame to training indices only
grade_valid <- grade[assignment == 2, ]  # subset the grade data frame to validation indices only
grade_test <- grade[assignment == 3, ]   # subset the grade data frame to test indices only


# In this exercise, we will use the grade_train dataset to fit a regression tree using rpart() and visualize it using rpart.plot()
# A regression tree plot will look identical to a classification tree plot, with the exception that there will be numeric values in the leaf nodes instead of predicted classes.

# This is very similar to what we did previously in Chapter 1
# When fitting a classification tree, we should use method = "class", however, when fitting a regression tree, we need to set method = "anova"
# By default, the rpart() function will make an intelligent guess as to what the method value should be based on the data type of your response column,
# but it's recommened that you explictly set the method for reproducibility reasons (since the auto-guesser may change in the future)

# The grade_train training set is loaded into the workspace

# Train the model
grade_model <- rpart::rpart(formula = final_grade ~ ., 
                     data = grade_train, 
                     method = "anova")

# Look at the model output                      
print(grade_model)
## n= 282 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 282 1519.49700 5.271277  
##    2) absences< 0.5 82  884.18600 4.323171  
##      4) paid=no 50  565.50500 3.430000  
##        8) famsup=yes 22  226.36360 2.272727 *
##        9) famsup=no 28  286.52680 4.339286 *
##      5) paid=yes 32  216.46880 5.718750  
##       10) age>=17.5 10   82.90000 4.100000 *
##       11) age< 17.5 22   95.45455 6.454545 *
##    3) absences>=0.5 200  531.38000 5.660000  
##      6) absences>=13.5 42  111.61900 4.904762 *
##      7) absences< 13.5 158  389.43670 5.860759  
##       14) schoolsup=yes 23   50.21739 4.847826 *
##       15) schoolsup=no 135  311.60000 6.033333  
##         30) studytime< 3.5 127  276.30710 5.940945 *
##         31) studytime>=3.5 8   17.00000 7.500000 *
# Plot the tree model
rpart.plot::rpart.plot(x = grade_model, yesno = 2, type = 0, extra = 0)

# Generate predictions on a test set
pred <- predict(object = grade_model,   # model object 
                newdata = grade_test)  # test dataset

# Compute the RMSE
Metrics::rmse(actual = grade_test$final_grade, 
     predicted = pred)
## [1] 2.278249
# Plot the "CP Table"
rpart::plotcp(grade_model)

# Print the "CP Table"
print(grade_model$cptable)
##           CP nsplit rel error    xerror       xstd
## 1 0.06839852      0 1.0000000 1.0080595 0.09215642
## 2 0.06726713      1 0.9316015 1.0920667 0.09543723
## 3 0.03462630      2 0.8643344 0.9969520 0.08632297
## 4 0.02508343      3 0.8297080 0.9291298 0.08571411
## 5 0.01995676      4 0.8046246 0.9357838 0.08560120
## 6 0.01817661      5 0.7846679 0.9337462 0.08087153
## 7 0.01203879      6 0.7664912 0.9092646 0.07982862
## 8 0.01000000      7 0.7544525 0.9407895 0.08399125
# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]

# Prune the model (to optimized cp value)
grade_model_opt <- rpart::prune(tree = grade_model, 
                         cp = cp_opt)
                          
# Plot the optimized model
rpart.plot::rpart.plot(x = grade_model_opt, yesno = 2, type = 0, extra = 0)

# Establish a list of possible values for minsplit and maxdepth
minsplit <- seq(1, 4, 1)
maxdepth <- seq(1, 6, 1)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)

# Check out the grid
head(hyper_grid)
##   minsplit maxdepth
## 1        1        1
## 2        2        1
## 3        3        1
## 4        4        1
## 5        1        2
## 6        2        2
# Print the number of grid combinations
nrow(hyper_grid)
## [1] 24
# Number of potential models in the grid
num_models <- nrow(hyper_grid)

# Create an empty list to store models
grade_models <- list()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {

    # Get minsplit, maxdepth values at row i
    minsplit <- hyper_grid$minsplit[i]
    maxdepth <- hyper_grid$maxdepth[i]

    # Train a model and store in the list
    grade_models[[i]] <- rpart::rpart(formula = final_grade ~ ., 
                               data = grade_train, 
                               method = "anova",
                               minsplit = minsplit,
                               maxdepth = maxdepth)
}


# Earlier in the chapter we split the dataset into three parts: training, validation and test

# A dataset that is not used in training is sometimes referred to as a "holdout" set
# A holdout set is used to estimate model performance and although both validation and test sets are considered to be holdout data, there is a key difference:

# Just like a test set, a validation set is used to evaluate the performance of a model
# The difference is that a validation set is specifically used to compare the performance of a group of models with the goal of choosing a "best model" from the group
# All the models in a group are evaluated on the same validation set and the model with the best performance is considered to the the winner.

# Once you have the best model, a final estimate of performance is computed on the test set.

# A test set should only ever be used to estimate model performance and should not be used in model selection
# Typically if you use a test set more than once, you are probably doing something wrong.

# Number of potential models in the grid
num_models <- length(grade_models)

# Create an empty vector to store RMSE values
rmse_values <- c()

# Write a loop over the models to compute validation RMSE
for (i in 1:num_models) {

    # Retreive the i^th model from the list
    model <- grade_models[[i]]
    
    # Generate predictions on grade_valid 
    pred <- predict(object = model,
                    newdata = grade_valid)
    
    # Compute validation RMSE and add to the 
    rmse_values[i] <- Metrics::rmse(actual = grade_valid$final_grade, 
                           predicted = pred)
}

# Identify the model with smallest validation set RMSE
best_model <- grade_models[[which.min(rmse_values)]]

# Print the model paramters of the best model
best_model$control
## $minsplit
## [1] 2
## 
## $minbucket
## [1] 1
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 1
## 
## $xval
## [1] 10
# Compute test set RMSE on best_model
pred <- predict(object = best_model, newdata = grade_test)
Metrics::rmse(actual = grade_test$final_grade, predicted = pred)
## [1] 2.124109

Chapter 3 - Bagged Trees

Introduction to bagged trees:

  • Bagging (bootstrap aggregation) combines many models in to an ensemble to reduce variance (overfitting)
    • Individual models are aggregated to an ensemble by way of averaging
    • Bootstrapping is smapling with replacement, meaning that some training data points will be represented multiple times
  • The bagging process occurs in several steps
    • The training set is split in to multiple smaller sets (typically size half the training data) using bootstrapping (sampling with replacement)
    • Trees are trained on each of the bootstrap samples - as many as have been specified (more is better as a rule of thumb)
    • Predictions are generated from each of the trees, then averaged together to create the ensemble prediction
  • Can use ipred::bagging(formula=, data=) to run bagged trees in R

Evaluating bagged tree performance:

  • Can use predict() to generate labels for the test set, using the same format as for a single tree
    • predict(object, newdata, type=“class”) # the “class” is so you get back classification labels as predictions
    • Can then use caret::confusionMatrix() to assess the performance of the bagged tree model
  • The ROC (receiver operator characteristic) is a commonly used curve to select a model
    • x-axis is false positive rate
    • y-axis is true positive rate
    • The AUC is the error under the curve and is a number between 0 and 1, with 0.5 being random chance and 1/0 being perfectly right/wrong
    • Can run Metrics::auc(actual, predicted) to auto-calculate the AUC

Using caret to cross-validate models:

  • This section will look at k-fold cross-validation, or the idea of k iterations of the model, each as a distinct subset with n/k points held out
    • Can then evaluate the model’s performance on the hold-out (the n/k held out at that fold), and average those k estimates for the parameter of interest
    • Generally, k-fold cross-validation will expand the processing time by a factor of k
  • The relevant portions of caret include caret::trainControl() and caret::train()
    • An example could be myTC <- trainControl(method=“cv”, number=5, classProbs=TRUE, summaryFunction=twoClassSummary) # 5-fold CV with AUC (twoClassSummary is a caret built-in for ROC)
    • And then train(formula, data=, method=“treebag”, metric=“ROC”, trControl=myTC) # to run bagging with the myTC control object from above

Example code includes:

# Let's start by training a bagged tree model
# You'll be using the bagging() function from the ipredpackage
# The number of bagged trees can be specified using the nbagg parameter, but here we will use the default (25)

# If we want to estimate the model's accuracy using the "out-of-bag" (OOB) samples, we can set the the coob parameter to TRUE
# The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training)
# Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging() function)

# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)

# Train a bagged model
credit_model <- ipred::bagging(formula = default ~ ., 
                        data = credit_train,
                        coob = TRUE)

# Print the model
print(credit_model)
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = default ~ ., data = credit_train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.2788
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,    
                            newdata = credit_test,  
                            type = "class")  # return classification labels

# Print the predicted classes
print(class_prediction)
##   [1] no  yes yes no  no  yes no  yes no  no  no  yes no  yes no  no  no 
##  [18] no  no  no  no  no  no  yes no  no  no  yes no  yes yes yes no  no 
##  [35] no  no  no  no  no  no  no  no  no  yes no  no  no  yes no  yes yes
##  [52] no  no  yes no  no  no  no  no  no  no  no  no  no  no  yes no  no 
##  [69] no  no  yes no  no  yes no  no  no  no  no  no  no  no  no  no  no 
##  [86] no  no  no  no  no  yes no  yes no  no  no  no  yes no  no  no  no 
## [103] no  no  yes no  no  no  no  no  no  no  no  no  no  no  no  no  no 
## [120] no  no  no  yes no  no  no  no  no  no  no  no  no  no  no  no  no 
## [137] no  no  no  no  yes no  yes no  yes no  no  no  no  no  no  no  yes
## [154] no  no  no  no  no  no  no  no  yes no  no  no  no  yes yes no  no 
## [171] no  no  yes yes no  no  no  no  no  no  no  yes no  no  no  no  yes
## [188] no  no  no  no  yes no  no  no  no  yes no  no  yes
## Levels: no yes
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,       
                reference = credit_test$default)  
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  126  36
##        yes  12  26
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6947, 0.8174)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.0178277       
##                                           
##                   Kappa : 0.3721          
##  Mcnemar's Test P-Value : 0.0009009       
##                                           
##             Sensitivity : 0.9130          
##             Specificity : 0.4194          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.6900          
##          Detection Rate : 0.6300          
##    Detection Prevalence : 0.8100          
##       Balanced Accuracy : 0.6662          
##                                           
##        'Positive' Class : no              
## 
# In binary classification problems, we can predict numeric values instead of class labels
# In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point

# The predicted label is generated by applying a threshold to the predicted value, such that all tests points with predicted value greater than that threshold get a predicted label of "1" and, points below that threshold get a predicted label of "0".

# In this exercise, generate predicted values (rather than class labels) on the test set and evaluate performance based on AUC (Area Under the ROC Curve)
# The AUC is a common metric for evaluating the discriminatory ability of a binary classification model

# Generate predictions on the test set
pred <- predict(object = credit_model,
                newdata = credit_test,
                type = "prob")

# `pred` is a matrix
class(pred)
## [1] "matrix"
# Look at the pred format
head(pred)
##        no  yes
## [1,] 0.96 0.04
## [2,] 0.28 0.72
## [3,] 0.36 0.64
## [4,] 0.76 0.24
## [5,] 0.92 0.08
## [6,] 0.48 0.52
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
(credit_ipred_model_test_auc <- Metrics::auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                                            predicted = pred[,"yes"]
                                            ))
## [1] 0.7809724
# Use caret::train() with the "treebag" method to train a model and evaluate the model using cross-validated AUC
# The caret package allows the user to easily cross-validate any model across any relevant performance metric
# In this case, we will use 5-fold cross validation and evaluate cross-validated AUC (Area Under the ROC Curve)

# The credit_train dataset is in your workspace. You will use this data frame as the training data

# Specify the training configuration
ctrl <- caret::trainControl(method = "cv",     # Cross-validation
                            number = 5,      # 5 folds
                            classProbs = TRUE,                  # For AUC
                            summaryFunction = caret::twoClassSummary)  # For AUC

# Cross validate the credit model using "treebag" method; 
# Track AUC (Area under the ROC curve)
set.seed(1)  # for reproducibility
credit_caret_model <- caret::train(default ~ ., data = credit_train, method = "treebag", 
                                   metric = "ROC", trControl = ctrl
                                   )
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:spatstat':
## 
##     panel.histogram
# Look at the model object
print(credit_caret_model)
## Bagged CART 
## 
## 800 samples
##  16 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 641, 640, 640, 639, 640 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7203687  0.8275126  0.4417553
# Inspect the contents of the model list 
names(credit_caret_model)
##  [1] "method"       "modelInfo"    "modelType"    "results"     
##  [5] "pred"         "bestTune"     "call"         "dots"        
##  [9] "metric"       "control"      "finalModel"   "preProcess"  
## [13] "trainingData" "resample"     "resampledCM"  "perfNames"   
## [17] "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "contrasts"    "xlevels"
# Print the CV AUC
credit_caret_model$results[,"ROC"]
## [1] 0.7203687
# Generate predictions on the test set
pred <- predict(object = credit_caret_model, 
                newdata = credit_test,
                type = "prob")

bag_preds <- pred[, "yes"]

# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
(credit_caret_model_test_auc <- Metrics::auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                                            predicted = pred[,"yes"]
                                            ))
## [1] 0.7762389
# In this excercise, you will print test set AUC estimates that you computed in previous exercises
# These two methods use the same code underneath, so the estimates should be very similar.

# The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the ipred::bagging() function
# The credit_caret_model_test_auc object stores the test set AUC from the model trained using the caret::train() function with method = "treebag"

# Lastly, we will print the 5-fold cross-validated estimate of AUC that is stored within the credit_caret_model object
# This number will be a more accurate estimate of the true model performance since we have averaged the performance over five models instead of just one

# On small datasets like this one, the difference between test set model performance estimates and cross-validated model performance estimates will tend to be more pronounced
# When using small data, it's recommended to use cross-validated estimates of performance because they are more stable

# Print ipred::bagging test set AUC estimate
print(credit_ipred_model_test_auc)
## [1] 0.7809724
# Print caret "treebag" test set AUC estimate
print(credit_caret_model_test_auc)
## [1] 0.7762389
# Compare to caret 5-fold cross-validated AUC
credit_caret_model$results[, "ROC"]
## [1] 0.7203687

Chapter 4 - Random Forests

Introduction to Random Forest:

  • Random forests tend to perform better than bagging, with a very nice combination of ease-of-use and power
    • Core concept is very similar to bagging - ensembles of bootstrap sampled trees
    • Extra randomness is added to the model though - each split of a tree can only consider a subset of variables as candidates for the split (feature bagging, random sub-space)
    • The extra randomness helps to make the trees more different from each other, improving the quality of the overall predictions
  • Can use randomForest::randomForest(formula, data=)
    • Adding trees (default is 500) will usually improve the answer

Understanding Random Forest model output:

  • The print() on a randomForest object will show many diagnostics - model formula/data, number of trees, mtry, OOB error estimate, confusion matrix
    • The absent samples for each tree of the randomForest can be used to calculate OOB error for that tree; each OOB estimate can be aggregated to mean, SE, etc.
  • The $err.rate call on a randomForest output will bring back the error matrix, with one row per tree - and, the last row has the overall estimate
  • The plot() on a randomForest object will show OOB error as a function of number of trees
    • After a certain point, including more trees does not enhance the quality of the predictions, but does increase computation time/complexity

OOB error vs. test-set error:

  • OOB error estimate advantages
    • Can evaluate models and estimate errors without any separate test data set
    • Calculated automatically by randomForest(), meaning there is no extra calculation needed
  • OOB error estimate disadvantages
    • No built-in capability for AUC, log-loss, and the like (only keeps track of the error rate; cannot calculate any other metrics after the fact)
    • Hard to compare randomForest to other models on metrics that may be more appropriate to other model types

Tuning a random-forest model:

  • Tuning the random forest model is generally easier than other models, since not much tuning is typically needed
  • Hyperparameters that are available include
    • ntree - number of trees (default 500)
    • mtry - number of candidate variables at each split (typically sqrt(nVar))
    • sampsize - number of samples to train on (defaults to 63.2% or the expected number of unique observations in a bootstrap sample)
    • nodesize - minimum size of terminal nodes
    • maxnodes - maximum number of terminal nodes (can be used to avoid over-fitting)
  • The mtry parameter is among the most important parameters in the randomForest process
    • Can use tuneRF(x=pred_train, y=train_response, ntreeTry=500) to automate the search
    • The tuneRF() process is merely one of many ways to tune a random forest
    • Can also use a grid-search process for either finer control on mtry or searched for mtry parameters in conjunction with other tuning variables

Example code includes:

# Here you will use the randomForest() function from the randomForest package to train a Random Forest classifier to predict loan default
# The credit_train and credit_test datasets (from Chapter 1 & 3) are already loaded in the workspace

# Train a Random Forest
set.seed(1)  # for reproducibility
credit_model <- randomForest::randomForest(formula = default ~ ., data = credit_train)
                             
# Print the model output                             
print(credit_model)
## 
## Call:
##  randomForest(formula = default ~ ., data = credit_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 24.12%
## Confusion matrix:
##      no yes class.error
## no  516  46  0.08185053
## yes 147  91  0.61764706
# Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err)
##            OOB        no       yes
## [1,] 0.3414634 0.2657005 0.5375000
## [2,] 0.3311966 0.2462908 0.5496183
## [3,] 0.3232831 0.2476636 0.5147929
## [4,] 0.3164933 0.2180294 0.5561224
## [5,] 0.3197756 0.2095808 0.5801887
## [6,] 0.3176944 0.2115385 0.5619469
# Look at final OOB error rate (last row in err matrix)
oob_err <- err[nrow(err), "OOB"]
print(oob_err)
##     OOB 
## 0.24125
# Plot the model trained in the previous exercise
plot(credit_model)

# Add a legend since it doesn't have one by default
legend(x = "right", 
       legend = colnames(err),
       fill = 1:ncol(err))

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,   # model object 
                            newdata = credit_test,  # test dataset
                            type = "class") # return classification labels
                            
# Calculate the confusion matrix for the test set
cm <- caret::confusionMatrix(data = class_prediction,       # predicted classes
                      reference = credit_test$default)  # actual classes
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  131  40
##        yes   7  22
##                                        
##                Accuracy : 0.765        
##                  95% CI : (0.7, 0.8219)
##     No Information Rate : 0.69         
##     P-Value [Acc > NIR] : 0.01186      
##                                        
##                   Kappa : 0.3563       
##  Mcnemar's Test P-Value : 3.046e-06    
##                                        
##             Sensitivity : 0.9493       
##             Specificity : 0.3548       
##          Pos Pred Value : 0.7661       
##          Neg Pred Value : 0.7586       
##              Prevalence : 0.6900       
##          Detection Rate : 0.6550       
##    Detection Prevalence : 0.8550       
##       Balanced Accuracy : 0.6521       
##                                        
##        'Positive' Class : no           
## 
# Compare test set accuracy to OOB accuracy
paste0("Test Accuracy: ", cm$overall[1])
## [1] "Test Accuracy: 0.765"
paste0("OOB Accuracy: ", 1 - oob_err)
## [1] "OOB Accuracy: 0.75875"
# Generate predictions on the test set
pred <- predict(object = credit_model,
            newdata = credit_test,
            type = "prob")

# `pred` is a matrix
class(pred)
## [1] "matrix" "votes"
# Look at the pred format
head(pred)
##       no   yes
## 3  0.894 0.106
## 10 0.294 0.706
## 11 0.414 0.586
## 14 0.772 0.228
## 27 0.760 0.240
## 28 0.618 0.382
# Compute the AUC (`actual` must be a binary 1/0 numeric vector)
Metrics::auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])                    
## [1] 0.8037634
rf_preds <- pred[, "yes"]

# In this exercise, you will use the randomForest::tuneRF() to tune mtry (by training several models)
# This function is a specific utility to tune the mtry parameter based on OOB error, which is helpful when you want a quick & easy way to tune your model
# A more generic way of any Random Forest parameter will be presented in the following exercise

# Use the tuneRF() function in place of the randomForest() function to train a series of models with different mtry values and examine the the results
# Note that (unfortunately) the tuneRF() interface does not support the typical formula input that we've been using
# but instead uses two arguments, x (matrix or data frame of predictor variables) and y (response vector; must be a factor for classification)
# The tuneRF() function has an argument, ntreeTry that defaults to 50 trees. Set nTreeTry = 500 to train a random forest model of the same size as you previously did
# After tuning the forest, this function will also plot model performance (OOB error) as a function of the mtry values that were evaluated
# Keep in mind that if we want to evaluate the model based on AUC instead of error (accuracy), then this is not the best way to tune a model, as the selection only considers (OOB) error

# Execute the tuning process
set.seed(1)              
res <- randomForest::tuneRF(x = subset(credit_train, select = -default), 
                            y = credit_train$default, 
                            ntreeTry = 500
                            )
## mtry = 4  OOB error = 24.12% 
## Searching left ...
## mtry = 2     OOB error = 24.62% 
## -0.02072539 0.05 
## Searching right ...
## mtry = 8     OOB error = 25.12% 
## -0.04145078 0.05

# Look at results
print(res)
##       mtry OOBError
## 2.OOB    2  0.24625
## 4.OOB    4  0.24125
## 8.OOB    8  0.25125
# Find the mtry value that minimizes OOB Error
mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)
## 4.OOB 
##     4
# If you just want to return the best RF model (rather than results)
# you can set `doBest = TRUE` in `tuneRF()` to return the best RF model
# instead of a set performance matrix.


# In Chapter 2, we created a manual grid of hyperparameters using the expand.grid() function and wrote code that trained and evaluated the models of the grid in a loop
# In this exercise, you will create a grid of mtry, nodesize and sampsize values
# In this example, we will identify the "best model" based on OOB error
# The best model is defined as the model from our grid which minimizes OOB error

# Keep in mind that there are other ways to select a best model from a grid, such as choosing the best model based on validation AUC
# However, for this exercise, we will use the built-in OOB error calculations instead of using a separate validation set

# Establish a list of possible values for mtry, nodesize and sampsize
mtry <- seq(4, ncol(credit_train) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(credit_train) * c(0.7, 0.8)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)

# Create an empty vector to store OOB error values
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
    model <- randomForest::randomForest(formula = default ~ ., 
                          data = credit_train,
                          mtry = hyper_grid$mtry[i],
                          nodesize = hyper_grid$nodesize[i],
                          sampsize = hyper_grid$sampsize[i])
                          
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
##   mtry nodesize sampsize
## 9   10        5      560

Chapter 5 - Boosted Trees

Introduction to boosting - by way of the gradient boosting machine (GBM):

  • Adaboost has been around for a long time, and is widely used
    • Train tree with all observations having an equal weight
    • Change the weights of the observations so that greater weights go on to the points that were harder to predict
    • The new model is the first two trees combined
    • Rinse and repeat as needed, with final predictions being a weighted aggregate of the various trees
  • GBM is newer and has taken significant share from Adaboost
    • Similar process of identifyinf shortcomings and working to overcome them
    • GBM identified shortcoming using gradients rather than high-weight data points
    • Popular due to excellent performance - often out-performs deep learning and allows for a user-defined cost function
    • Caution is GBM can pretty easily over-fit, meaning that some early stopping parameters are needed
  • The formula interface is gbm(formula, data=, distribution=“bernoulli”, n.trees=5000)
    • Binary classification goes to “bernoulli”
    • The default number of trees (n.trees=100) is 100, which is a reasonable starting point

Understanding GBM model output:

  • Printing the GBM model object will show the number of trees used, and the number of variables with no influence on the model
  • Tree-based models have a built-in mechanism for assessing variable importance
    • The summary() call produces both a variable importance table and an associated plot
  • Can create predictions using predict(model, type=“response”, n.trees=)
    • There is no default value for n.trees; can use the same value as in training, though this may be sub-optimal
    • The predict() function is an alias, in this case to predict.gbm

GBM hyper-parameters - especially important due to risk of over-fitting:

  • Tuning the number of iterations in an iterative algorithm such as GBM is referred to as “early stopping”
  • Important hyper-parameters in GBM include
    • n.trees - number of trees
    • bag.fraction - proportion of observations to be sampled in each tree (sampsize is the equivalent in randomForest)
    • n.minobsinnode - minimum observations in terminal nodes
    • interaction.depth - maximum nodes and/or splits
    • shrinkage - learning rate, impact of each additional tree (higher values penalize additional iterations) - slower learning rates are typically better
  • Early stopping is the idea of ending the iterative modeling based on feedback from a holdout data set
    • The ideal time to stop is once the validation error has stabilized, and before the validation error has begun increasing due to over-fitting
    • The GBM package has a built-in process for optimizing early stopping - gbm.perf(model, method=“cv”) # get optimal ntree based on cv error (method=“OOB” will get optimal ntree based on OOB error)

Model comparison via ROC and AUC:

  • Comparisons across all the types of models - trees, bagged trees, boosted trees, GBM
  • Calculate AUC for the test data for each of the model types

Example code includes:

# Here you will use the gbm() function to train a GBM classifier to predict loan default
# You will train a 10,000-tree GBM on the credit_train dataset, which is pre-loaded into your workspace

# Using such a large number of trees (10,000) is probably not optimal for a GBM model, but we will build more trees than we need and then select the optimal number of trees based on early performance-based stopping
# The best GBM model will likely contain fewer trees than we started with

# For binary classification, gbm() requires the response to be encoded as 0/1 (numeric), so we will have to convert from a "no/yes" factor to a 0/1 numeric response column

# Also, the the gbm() function requires the user to specify a distribution argument
# For a binary classification problem, you should set distribution = "bernoulli". The Bernoulli distribution models a binary response

# Convert "yes" to 1, "no" to 0
credit_train$default <- ifelse(credit_train$default == "yes", 1, 0)

# Train a 10000-tree GBM model
set.seed(1)
credit_model <- gbm::gbm(formula = default ~ ., 
                    distribution = "bernoulli", 
                    data = credit_train,
                    n.trees = 10000)
                    
# Print the model object                    
print(credit_model)
## gbm::gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train, 
##     n.trees = 10000)
## A gradient boosted model with bernoulli loss function.
## 10000 iterations were performed.
## There were 16 predictors of which 16 had non-zero influence.
# summary() prints variable importance
summary(credit_model)

##                                       var     rel.inf
## checking_balance         checking_balance 33.49502510
## amount                             amount 11.62938098
## months_loan_duration months_loan_duration 11.17113439
## credit_history             credit_history 11.15698321
## savings_balance           savings_balance  6.44293358
## employment_duration   employment_duration  6.06266137
## age                                   age  5.73175696
## percent_of_income       percent_of_income  3.74219743
## other_credit                 other_credit  3.56695375
## purpose                           purpose  3.38820798
## housing                           housing  1.55169398
## years_at_residence     years_at_residence  1.35255308
## job                                   job  0.47631930
## phone                               phone  0.09142691
## existing_loans_count existing_loans_count  0.08924265
## dependents                     dependents  0.05152933
# The gbm package uses a predict() function to generate predictions from a model, similar to many other machine learning packages in R
# When you see a function like predict() that works on many different types of input (a GBM model, a RF model, a GLM model, etc), that indicates that predict() is an "alias" for a GBM-specific version of that function
# The GBM specific version of that function is predict.gbm(), but for convenience sake, we can just use predict() (either works)

# One thing that's particular to the predict.gbm() however, is that you need to specify the number of trees used in the prediction
# There is no default, so you have to specify this manually
# For now, we can use the same number of trees that we specified when training the model, which is 10,000 (though this may not be the optimal number to use)

# Another argument that you can specify is type, which is only relevant to Bernoulli and Poisson distributed outcomes
# When using Bernoulli loss, the returned value is on the log odds scale by default and for Poisson, it's on the log scale
# If instead you specify type = "response", then gbm converts the predicted values back to the same scale as the outcome
# This will convert the predicted values into probabilities for Bernoulli and expected counts for Poisson

# Since we converted the training response col, let's also convert the test response col
credit_test$default <- ifelse(credit_test$default == "yes", 1, 0)

# Generate predictions on the test set
preds1 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = 10000)

# Generate predictions on the test set (scale to response)
preds2 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = 10000,
                  type = "response")

# Compare the range of the two sets of predictions
range(preds1)
## [1] -3.210354  2.088293
range(preds2)
## [1] 0.03877796 0.88976007
# Compute test set AUC of the GBM model for the two sets of predictions
# We will notice that they are the same value
# That's because AUC is a rank-based metric, so changing the actual values does not change the value of the AUC
# However, if we were to use a scale-aware metric like RMSE to evaluate performance, we would want to make sure we converted the predictions back to the original scale of the response

# Generate the test set AUCs using the two sets of preditions & compare
Metrics::auc(actual = credit_test$default, predicted = preds1)  #default
## [1] 0.7875175
Metrics::auc(actual = credit_test$default, predicted = preds2)  #rescaled
## [1] 0.7875175
# Use the gbm.perf() function to estimate the optimal number of boosting iterations (aka n.trees) for a GBM model object using both OOB and CV error
# When you set out to train a large number of trees in a GBM (such as 10,000) and you use a validation method to determine an earlier (smaller) number of trees, then that's called "early stopping"
# The term "early stopping" is not unique to GBMs, but can describe auto-tuning the number of iterations in an iterative learning algorithm

# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm::gbm.perf(object = credit_model, 
                          method = "OOB", 
                          oobag.curve = TRUE)
## Warning in gbm::gbm.perf(object = credit_model, method = "OOB", oobag.curve
## = TRUE): OOB generally underestimates the optimal number of iterations
## although predictive performance is reasonably competitive. Using cv.folds>0
## when calling gbm usually results in improved predictive performance.

# Train a CV GBM model
set.seed(1)
credit_model_cv <- gbm::gbm(formula = default ~ ., 
                       distribution = "bernoulli", 
                       data = credit_train,
                       n.trees = 10000,
                       cv.folds = 2)

# Optimal ntree estimate based on CV
ntree_opt_cv <- gbm::gbm.perf(object = credit_model_cv, 
                         method = "cv")

# Compare the estimates                         
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))                         
## [1] "Optimal n.trees (OOB Estimate): 3233"
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
## [1] "Optimal n.trees (CV Estimate): 7889"
# In the previous exercise, we used OOB error and cross-validated error to estimate the optimal number of trees in the GBM
# These are two different ways to estimate the optimal number of trees, so in this exercise we will compare the performance of the models on a test set
# We can use the same model object to make both of these estimates since the predict.gbm() function allows you to use any subset of the total number of trees (in our case, the total number is 10,000).

# Generate predictions on the test set using ntree_opt_oob number of trees
preds1 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_oob)
                  
# Generate predictions on the test set using ntree_opt_cv number of trees
preds2 <- predict(object = credit_model, 
                  newdata = credit_test,
                  n.trees = ntree_opt_cv) 

gbm_preds <- preds2

# Generate the test set AUCs using the two sets of preditions & compare
auc1 <- Metrics::auc(actual = credit_test$default, predicted = preds1)  #OOB
auc2 <- Metrics::auc(actual = credit_test$default, predicted = preds2)  #CV 

# Compare AUC 
print(paste0("Test set AUC (OOB): ", auc1))                         
## [1] "Test set AUC (OOB): 0.777816736792894"
print(paste0("Test set AUC (CV): ", auc2))
## [1] "Test set AUC (CV): 0.785530621785881"
# In this final exercise, we will perform a model comparison across all types of models that we've learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM)
# The models were all trained on the same training set, credit_train, and predictions were made for the credit_test dataset

# We have pre-loaded four sets of test set predictions, generated using the models we trained in previous chapters (one for each model type)
# The numbers stored in the prediction vectors are the raw predicted values themselves -- not the predicted class labels
# Using the raw predicted values, we can calculate test set AUC for each model and compare the results

# Loaded in your workspace are four numeric vectors: dt_preds , bag_preds , rf_preds , gbm_preds
# These predictions were made on credit_test, which is also loaded into the workspace

# Generate the test set AUCs using the two sets of predictions & compare
actual <- as.numeric(credit_test$default)
dt_auc <- Metrics::auc(actual = actual, predicted = dt_preds)
bag_auc <- Metrics::auc(actual = actual, predicted = bag_preds)
rf_auc <- Metrics::auc(actual = actual, predicted = rf_preds)
gbm_auc <- Metrics::auc(actual = actual, predicted = gbm_preds)

# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
## [1] "Decision Tree Test AUC: 0.650"
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
## [1] "Bagged Trees Test AUC: 0.776"
sprintf("Random Forest Test AUC: %.3f", rf_auc)
## [1] "Random Forest Test AUC: 0.804"
sprintf("GBM Test AUC: %.3f", gbm_auc)
## [1] "GBM Test AUC: 0.786"
# We conclude this course by plotting the ROC curves for all the models (one from each chapter) on the same graph
# The ROCR package provides the prediction() and performance() functions which generate the data required for plotting the ROC curve, given a set of predictions and actual (true) values

# The more "up and to the left" the ROC curve of a model is, the better the model
# The AUC performance metric is literally the "Area Under the ROC Curve", so the greater the area under this curve, the higher the AUC, and the better-performing the model is

# List of predictions
preds_list <- list(dt_preds, bag_preds, rf_preds, gbm_preds)

# List of actual values (same for all)
m <- length(preds_list)
actuals_list <- rep(list(credit_test$default), m)

# Plot the ROC curves
pred <- ROCR::prediction(preds_list, actuals_list)
rocs <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright", 
       legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
       fill = 1:m)


Supervised Learning in R: Classification

Chapter 1 - k-Nearest Neighbors (kNN)

Classification with Nearest Neighbors:

  • Brett Lantz, “Machine Learning with R”
  • Supervised learning trains the machine from previous examples; classification is predicting a factor variable rather than a continuous variable
  • Example of automated driving - need to identify and classify objects captured on the camera
    • Car records your actions while it sees various signs, and then classifies like signs in to appropriate segments and actions
  • Nearest neighbor algorithms measure the “distance” between signs based on coordinates in a feature space
    • Color as a 3-D RGB would be an example; “close” would mean “similar” on the RGB scale
    • Euclidean distance is a common metric for distance, and is calculated automatically by R
  • The knn algorithms can be run in R using:
    • knn(trainData, testData, trainLabels)

What about the “k” in kNN?

  • The k is a variable for the number of neighbors ro consider - “size of the neighborhood”
  • The value of k can significantly impact the performance of the classifier
    • Sometimes, the very closest neighbor is not a good match; there are benefits to aggregating (voting in classification)
    • In the case of a tie, the winner of the vote is typically determined at random
    • If the neighborhood becomes too big (large k), the classifier loses out on signal
  • There is a bias-variance tradeoff for k
    • Large k may produce bias, while small k may produce variance (over-fits)
    • A starting point for k can be the square root of the number of observations in the dataset
    • Assessing performance on the test dataset is a great way to tune the k-parameter

Data preparation for kNN:

  • Since nearest neighbors models use distance, the data need to be properly pre-processed
    • For example, conversion of colors (factors) to RGB (three numbers)
    • Alternately, converting shapes to dummy (1/0) variables
    • All features should be measured with the same range of data - scale and normalize prior to running models

Example code includes:

# The dataset signs is loaded in your workspace along with the dataframe next_sign, which holds the observation you want to classify.

signRaw <- readr::read_csv("./RInputFiles/knn_traffic_signs.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   sample = col_character(),
##   sign_type = col_character()
## )
## See spec(...) for full column specifications.
signs <- signRaw %>% filter(sample=="train") %>% select(-id, -sample)
test_signs <- signRaw %>% filter(sample=="test") %>% select(-id, -sample)
next_sign <- signRaw %>% filter(sample=="example") %>% select(-id, -sample, -sign_type)

# Load the 'class' package
# library(class)

# Create a vector of labels
sign_types <- signs$sign_type

# Classify the next sign observed
class::knn(train = signs[-1], test = next_sign, cl = sign_types)
## [1] stop
## Levels: pedestrian speed stop
# To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.
# Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.
# The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

# Examine the structure of the signs dataset
str(signs, give.attr=FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    146 obs. of  49 variables:
##  $ sign_type: chr  "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
##  $ r1       : int  155 142 57 22 169 75 136 149 13 123 ...
##  $ g1       : int  228 217 54 35 179 67 149 225 34 124 ...
##  $ b1       : int  251 242 50 41 170 60 157 241 28 107 ...
##  $ r2       : int  135 166 187 171 231 131 200 34 5 83 ...
##  $ g2       : int  188 204 201 178 254 89 203 45 21 61 ...
##  $ b2       : int  101 44 68 26 27 53 107 1 11 26 ...
##  $ r3       : int  156 142 51 19 97 214 150 155 123 116 ...
##  $ g3       : int  227 217 51 27 107 144 167 226 154 124 ...
##  $ b3       : int  245 242 45 29 99 75 134 238 140 115 ...
##  $ r4       : int  145 147 59 19 123 156 171 147 21 67 ...
##  $ g4       : int  211 219 62 27 147 169 218 222 46 67 ...
##  $ b4       : int  228 242 65 29 152 190 252 242 41 52 ...
##  $ r5       : int  166 164 156 42 221 67 171 170 36 70 ...
##  $ g5       : int  233 228 171 37 236 50 158 191 60 53 ...
##  $ b5       : int  245 229 50 3 117 36 108 113 26 26 ...
##  $ r6       : int  212 84 254 217 205 37 157 26 75 26 ...
##  $ g6       : int  254 116 255 228 225 36 186 37 108 26 ...
##  $ b6       : int  52 17 36 19 80 42 11 12 44 21 ...
##  $ r7       : int  212 217 211 221 235 44 26 34 13 52 ...
##  $ g7       : int  254 254 226 235 254 42 35 45 27 45 ...
##  $ b7       : int  11 26 70 20 60 44 10 19 25 27 ...
##  $ r8       : int  188 155 78 181 90 192 180 221 133 117 ...
##  $ g8       : int  229 203 73 183 110 131 211 249 163 109 ...
##  $ b8       : int  117 128 64 73 9 73 236 184 126 83 ...
##  $ r9       : int  170 213 220 237 216 123 129 226 83 110 ...
##  $ g9       : int  216 253 234 234 236 74 109 246 125 74 ...
##  $ b9       : int  120 51 59 44 66 22 73 59 19 12 ...
##  $ r10      : int  211 217 254 251 229 36 161 30 13 98 ...
##  $ g10      : int  254 255 255 254 255 34 190 40 27 70 ...
##  $ b10      : int  3 21 51 2 12 37 10 34 25 26 ...
##  $ r11      : int  212 217 253 235 235 44 161 34 9 20 ...
##  $ g11      : int  254 255 255 243 254 42 190 44 23 21 ...
##  $ b11      : int  19 21 44 12 60 44 6 35 18 20 ...
##  $ r12      : int  172 158 66 19 163 197 187 241 85 113 ...
##  $ g12      : int  235 225 68 27 168 114 215 255 128 76 ...
##  $ b12      : int  244 237 68 29 152 21 236 54 21 14 ...
##  $ r13      : int  172 164 69 20 124 171 141 205 83 106 ...
##  $ g13      : int  235 227 65 29 117 102 142 229 125 69 ...
##  $ b13      : int  244 237 59 34 91 26 140 46 19 9 ...
##  $ r14      : int  172 182 76 64 188 197 189 226 85 102 ...
##  $ g14      : int  228 228 84 61 205 114 171 246 128 67 ...
##  $ b14      : int  235 143 22 4 78 21 140 59 21 6 ...
##  $ r15      : int  177 171 82 211 125 123 214 235 85 106 ...
##  $ g15      : int  235 228 93 222 147 74 221 252 128 69 ...
##  $ b15      : int  244 196 17 78 20 22 201 67 21 9 ...
##  $ r16      : int  22 164 58 19 160 180 188 237 83 43 ...
##  $ g16      : int  52 227 60 27 183 107 211 254 125 29 ...
##  $ b16      : int  53 237 60 29 187 26 227 53 19 11 ...
# Count the number of signs of each type
table(signs$sign_type)
## 
## pedestrian      speed       stop 
##         46         49         51
# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)
##    sign_type       r10
## 1 pedestrian 113.71739
## 2      speed  80.63265
## 3       stop 132.39216
# Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course
# The test course includes 59 additional road signs divided into three types:
# At the conclusion of the trial, you are asked to measure the car's overall performance at recognizing these signs

# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- class::knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the actual versus predicted values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)
##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     2    0
##   speed               0    17    0
##   stop                0     2   19
# Compute the accuracy
mean(signs_pred == signs_actual)
## [1] 0.9322034
# By default, the knn() function in the class package uses only the single nearest neighbor
# Setting a k parameter allows the algorithm to consider additional nearby neighbors
# This enlarges the collection of neighbors which will vote on the predicted class
# Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- class::knn(train = signs[, -1], test = test_signs[, -1], cl = signs$sign_type)
mean(k_1 == signs_actual)
## [1] 0.9322034
# Modify the above to set k = 7
k_7 <- class::knn(train = signs[, -1], test = test_signs[, -1], cl = signs$sign_type, k=7)
mean(k_7 == signs_actual)
## [1] 0.9661017
# Set k = 15 and compare to the above
k_15 <- class::knn(train = signs[, -1], test = test_signs[, -1], cl = signs$sign_type, k=15)
mean(k_15 == signs_actual)
## [1] 0.8813559
# When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated
# For example, knowing more about the voters' confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead
# In this exercise, you will learn how to obtain the voting results from the knn() function

# The class package has already been loaded in your workspace along with the dataset signs
# Build a kNN model with the prob = TRUE parameter to compute the vote proportions. Set k = 7

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- class::knn(train = signs[, -1], test = test_signs[, -1], cl = signs$sign_type, k=7, prob=TRUE)

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)
## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop
# Examine the proportion of votes for the winning class
head(sign_prob)
## [1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286

Chapter 2 - Naïve Bayes

Understanding Bayesian Models:

  • Bayesian methods apply the methods of Thomas Bayes to use data to estimate probabilities
  • Based on the data, phones can understand locations, times of days, etc., as well as their joint probabilities
    • The joint probability of two events is denoted as P(A and B)
    • When one event is predictive of another, the events are considered to be “dependent”
    • P(A | B) = P(A and B) | P(B)
  • The Naïve Bayes algorithm makes predictions based on these joint probabilities
    • naivebayes::naïve_bayes(y ~ x, data=)
    • predict(myNB, newdata=)

Understanding NB’s “naivety”:

  • With multiple predictors, the Venn diagrams for overlap can become very messy (and confusing)
    • It also becomes more complex for the computer, and is computattionally inefficient
  • The “naïve” simplification is to assume that events are independent; therefore, the full Venn diagram is not needed
    • Instead, just multiply the point probabilities together
    • Research has shown that while probabilities are rarely independent, multiplying them together as though they are independent tends to perform well
  • There are also infrequency problems - rare events that get classified as zero can trump all the other probabilities
    • A typical work-around is to add +1 to every outcome (Laplace correction), meaning there is no join probability of zero and thus no full vert power

Applying Naïve Bayes to other problems:

  • Naïve Bayes tends to works well when information needs to be considered simultaneously and evaluated as a whole
    • Frequently used for text data (e.g., classifying e-mail as spam)
  • Under the hood, Naïve Bayes is building contingency tables for predictive value of each of the variables
    • As a consequence, continuous variables tend not to perform well for Naïve Bayes
    • Pre-processing the data (e.g., binning, bag of words) can convert the continuous variables to categorical variables

Example code includes:

locationsRaw <- read.csv("./RInputFiles/locations.csv")
str(locationsRaw, give.attr=FALSE)
## 'data.frame':    2184 obs. of  7 variables:
##  $ month   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day     : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ weekday : Factor w/ 7 levels "friday","monday",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ daytype : Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hour    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ hourtype: Factor w/ 4 levels "afternoon","evening",..: 4 4 4 4 4 4 3 3 3 3 ...
##  $ location: Factor w/ 7 levels "appointment",..: 3 3 3 3 3 3 3 3 3 4 ...
where9am <- locationsRaw %>% filter(hour==9) %>% select(daytype, location) %>% droplevels()
str(where9am, give.attr=FALSE)
## 'data.frame':    91 obs. of  2 variables:
##  $ daytype : Factor w/ 2 levels "weekday","weekend": 1 1 1 2 2 1 1 1 1 1 ...
##  $ location: Factor w/ 4 levels "appointment",..: 4 4 4 3 3 2 3 1 4 4 ...
# The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday
# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, daytype == "weekday" & location == "office")) / nrow(where9am)

# Compute P(A | B)
(p_A_given_B <- p_AB / p_B)
## [1] 0.6
# Build the location prediction model
locmodel <- naivebayes::naive_bayes(location ~ daytype, data = where9am)

# Predict Thursday's 9am location
thursday9am <- data.frame(daytype=factor("weekday", levels=c("weekday", "weekend")))
predict(locmodel, thursday9am)
## [1] office
## Levels: appointment campus home office
# Predict Saturdays's 9am location
saturday9am <- data.frame(daytype=factor("weekend", levels=c("weekday", "weekend")))
predict(locmodel, saturday9am)
## [1] home
## Levels: appointment campus home office
# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built

# Examine the location prediction model
locmodel
## ===================== Naive Bayes ===================== 
## Call: 
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
## 
## A priori probabilities: 
## 
## appointment      campus        home      office 
##  0.01098901  0.10989011  0.45054945  0.42857143 
## 
## Tables: 
##          
## daytype   appointment    campus      home    office
##   weekday   1.0000000 1.0000000 0.3658537 1.0000000
##   weekend   0.0000000 0.0000000 0.6341463 0.0000000
# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, thursday9am , type = "prob")
##      appointment    campus      home office
## [1,]  0.01538462 0.1538462 0.2307692    0.6
# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, saturday9am , type = "prob")
##      appointment campus home office
## [1,]           0      0    1      0
# The locations dataset records Brett's location every hour for 13 weeks
# Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night)

# Using this data, build a more sophisticated model to see how Brett's predicted location not only varies by the day of week but also by the time of day

# The dataset locations is already loaded in your workspace

# The 'naivebayes' package is loaded into the workspace already

# Build a NB model of location
locmodel <- naivebayes::naive_bayes(location ~ daytype + hourtype, data=locationsRaw)

# Predict Brett's location on a weekday afternoon
weekday_afternoon <- data.frame(daytype=factor("weekday", levels=c("weekday", "weekend")), 
                                hourtype=factor("afternoon", levels=c("afternoon", "evening", "morning", "night")), 
                                location=factor("office", levels=c("appointment", "campus", "home", "office", "restaurant", "store", "theater"))
                                )
predict(locmodel, weekday_afternoon)
## [1] office
## Levels: appointment campus home office restaurant store theater
# Predict Brett's location on a weekday evening
weekday_evening <- data.frame(daytype=factor("weekday", levels=c("weekday", "weekend")), 
                                hourtype=factor("evening", levels=c("afternoon", "evening", "morning", "night")), 
                                location=factor("home", levels=c("appointment", "campus", "home", "office", "restaurant", "store", "theater"))
                                )
predict(locmodel, weekday_evening)
## [1] home
## Levels: appointment campus home office restaurant store theater
# While Brett was tracking his location over 13 weeks, he never went into the office during the weekend
# Consequently, the joint probability of P(office and weekend) = 0

# Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future
# Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances

# The 'naivebayes' package is loaded into the workspace already
# The Naive Bayes location model (locmodel) has already been built

# Observe the predicted probabilities for a weekend afternoon
weekend_afternoon <- data.frame(daytype=factor("weekend", levels=c("weekday", "weekend")), 
                                hourtype=factor("afternoon", levels=c("afternoon", "evening", "morning", "night")), 
                                location=factor("home", levels=c("appointment", "campus", "home", "office", "restaurant", "store", "theater"))
                                )
predict(locmodel, weekend_afternoon, type="prob")
##      appointment campus      home office restaurant      store theater
## [1,]  0.02472535      0 0.8472217      0  0.1115693 0.01648357       0
# Build a new model using the Laplace correction
locmodel2 <- naivebayes::naive_bayes(location ~ daytype + hourtype, data=locationsRaw, laplace=1)

# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, weekend_afternoon, type="prob")
##      appointment      campus      home      office restaurant      store
## [1,]  0.01107985 0.005752078 0.8527053 0.008023444  0.1032598 0.01608175
##          theater
## [1,] 0.003097769

Chapter 3 - Logistic Regression

Making binary predictions with regression:

  • Logistic regression techniques for predicting “yes/no” as far as whether someone will make a certain decision
  • Since straight lines do not fit well to binary outcome data, the goal is to fit an S-shaped curve (logit) to the data
    • Outcomes will always be between 0 and 1
    • glm(y ~ x, data=, family=“binomial”)
  • Can convert the outcomes of a logistic regression to a categorical by setting a threshhold (e.g., 50% for yes/no)

Model performance trade-offs:

  • Rare events frequently pose challenges for classification models - high accuracy can be gained by just betting against the rare outcome
  • Sometimes it is better to give up some overall accuracy in exchange for imporved performance on one of the other diagnostic measures
  • The ROC (receiver operator charcetristic) curve can help with the trade-offs; the 45-degree line is a model with no predictive power
    • AUC (area under the curve) is 0.5 for the 45-degree line, 1.0 for perfect predictive power, 0.0 for getting every prediction wrong
    • However, can get the same AUC with different shapes of the ROC curve; easy vs. hard cases, for example

Dummy variables, missing data, and interactions:

  • The predictors need to all be numeric; categorical data must be converted to dummy (1/0) variables and missing data must be managed
  • It is generally a good idea to convert any numerical variable (even a number masquerading for a category like 1=High) to a factor first
  • Missing data can be addressed using imputation (mean, median, kNN, etc.)
    • Alternately, the missing indicator (1=missing, 0=present) can be very important in predicting
  • Can also have interaction effects among the predictor variables - e.g., obesity and smoking
    • The full interaction is modeled as y ~ xz (x alone, z alone, xz together)

Automatic feature selection:

  • Humans typically need to specify regression predictors ahead of time
  • Can use the automatic feature selection, though there are caveats and cautions
  • The general approach is called backward-deletion, where a model is run with all predictors first
    • At each step, the least important predictor is removed, and the new model accuracy is calculated
  • The parallel approach is called forward-addition, where a model is run with just one predictor first
    • At each step, the next most important predictor is added, and the new model is calculated
  • The backward-deletion and forward-addition models can find different solutions than each other, and also violate some statistical assumptions
    • The model may be much better for prediction than for explanation
    • The model can be built in the absence of theory or common sense, and the coefficients may be significantly misleading

Example code includes:

# The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans
# The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise
# This binary outcome will be the dependent variable for the logistic regression model

# The remaining columns are features of the prospective donors that may influence their donation behavior
# These are the model's independent variables

# When building a regression model, it it often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable
# The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation
# Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving

# The dataset donors is available in your workspace.
donorsRaw <- read.csv("./RInputFiles/donors.csv")
str(donorsRaw)
## 'data.frame':    93462 obs. of  13 variables:
##  $ donated          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : int  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...
# Examine the dataset to identify potential independent variables
donors <- donorsRaw
str(donors)
## 'data.frame':    93462 obs. of  13 variables:
##  $ donated          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : int  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...
# Explore the dependent variable
table(donors$donated)
## 
##     0     1 
## 88751  4711
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)
## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5
# By default, predict() outputs predictions in terms of log odds unless type = "response" is specified
# This converts the log odds to probabilities

# Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action
# One must balance the extremes of being too cautious versus being too aggressive
# For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate
# This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

# Find the donation probability of the average prospect
mean(donors$donated)
## [1] 0.05040551
# Predict a donation if probability of donation is greater than average (0.0504)
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
## [1] 0.794815
# The dataset donors with the column of predicted probabilities, donation_prob ,is already loaded in your workspace.
# Load the pROC package
# library(pROC)

# Create a ROC curve
ROC <- pROC::roc(donors$donated, donors$donation_prob)

# Plot the ROC curve
plot(ROC, col = "blue")

# Calculate the area under the curve (AUC)
pROC::auc(ROC)
## Area under the curve: 0.5102
# In the donors dataset, wealth_rating uses numbers to indicate the donor's wealth level:
# 0 = Unknown
# 1 = Low
# 2 = Medium
# 3 = High

# Convert the wealth rating to a factor
donors$wealth_rating <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_rating <- relevel(donors$wealth_rating, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_rating, data=donors, family="binomial"))
## 
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_ratingUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_ratingLow     -0.05245    0.05332  -0.984    0.325    
## wealth_ratingHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5
# Find the average age among non-missing values
summary(donors$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546
# Impute missing age values with mean(age)
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm=TRUE), 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)


# One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M
# Recency
# Frequency
# Money

# Donors that haven given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects

# Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ money + recency*frequency, data=donors, family="binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)
## 
## Call:
## glm(formula = donated ~ money + recency * frequency, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, type="response")

# Plot the ROC curve and find AUC for the new model
ROC <- pROC::roc(donors$donated, rfm_prob)
plot(ROC, col = "red")

pROC::auc(ROC)
## Area under the curve: 0.5785
# In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.
# In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen

donorsSmall <- donors[sample(seq_len(nrow(donors)), 20000, replace=FALSE), ] %>% 
    select(-donation_prob, -donation_pred, -age)
str(donorsSmall)
## 'data.frame':    20000 obs. of  14 variables:
##  $ donated          : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ has_children     : int  0 0 0 1 0 1 0 0 0 0 ...
##  $ wealth_rating    : Factor w/ 4 levels "Medium","Unknown",..: 4 4 3 2 2 4 1 2 2 4 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ interest_religion: int  1 0 0 0 0 1 0 0 1 0 ...
##  $ pet_owner        : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 2 1 2 1 1 1 2 2 1 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 2 2 2 2 2 1 2 2 2 ...
##  $ imputed_age      : num  52 35 78 27 24 ...
##  $ missing_age      : num  0 0 0 0 0 0 1 0 1 0 ...
# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donorsSmall, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data=donorsSmall, family="binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
## Start:  AIC=7705.35
## donated ~ 1
## 
##                     Df Deviance    AIC
## + frequency          1   7642.9 7646.9
## + money              1   7694.7 7698.7
## + imputed_age        1   7696.6 7700.6
## + bad_address        1   7699.4 7703.4
## + missing_age        1   7699.5 7703.5
## + has_children       1   7700.8 7704.8
## <none>                   7703.3 7705.3
## + wealth_rating      3   7697.9 7705.9
## + pet_owner          1   7703.0 7707.0
## + catalog_shopper    1   7703.0 7707.0
## + recency            1   7703.1 7707.1
## + interest_religion  1   7703.3 7707.3
## + veteran            1   7703.3 7707.3
## + interest_veterans  1   7703.3 7707.3
## 
## Step:  AIC=7646.94
## donated ~ frequency
## 
##                     Df Deviance    AIC
## + bad_address        1   7638.5 7644.5
## + imputed_age        1   7638.5 7644.5
## + missing_age        1   7638.9 7644.9
## + money              1   7639.5 7645.5
## + has_children       1   7640.4 7646.4
## <none>                   7642.9 7646.9
## + wealth_rating      3   7637.3 7647.3
## + pet_owner          1   7642.6 7648.6
## + catalog_shopper    1   7642.7 7648.7
## + recency            1   7642.8 7648.8
## + interest_veterans  1   7642.9 7648.9
## + veteran            1   7642.9 7648.9
## + interest_religion  1   7642.9 7648.9
## 
## Step:  AIC=7644.47
## donated ~ frequency + bad_address
## 
##                     Df Deviance    AIC
## + imputed_age        1   7633.6 7641.6
## + missing_age        1   7634.5 7642.5
## + money              1   7635.2 7643.2
## + has_children       1   7635.8 7643.8
## <none>                   7638.5 7644.5
## + wealth_rating      3   7633.0 7645.0
## + pet_owner          1   7638.1 7646.1
## + catalog_shopper    1   7638.3 7646.3
## + recency            1   7638.3 7646.3
## + interest_veterans  1   7638.4 7646.4
## + veteran            1   7638.4 7646.4
## + interest_religion  1   7638.4 7646.4
## 
## Step:  AIC=7641.63
## donated ~ frequency + bad_address + imputed_age
## 
##                     Df Deviance    AIC
## + missing_age        1   7629.9 7639.9
## + money              1   7630.3 7640.3
## <none>                   7633.6 7641.6
## + wealth_rating      3   7627.7 7641.7
## + has_children       1   7632.5 7642.5
## + pet_owner          1   7632.9 7642.9
## + catalog_shopper    1   7633.2 7643.2
## + recency            1   7633.5 7643.5
## + interest_veterans  1   7633.5 7643.5
## + veteran            1   7633.6 7643.6
## + interest_religion  1   7633.6 7643.6
## 
## Step:  AIC=7639.92
## donated ~ frequency + bad_address + imputed_age + missing_age
## 
##                     Df Deviance    AIC
## + money              1   7626.6 7638.6
## <none>                   7629.9 7639.9
## + has_children       1   7628.1 7640.1
## + wealth_rating      3   7624.9 7640.9
## + pet_owner          1   7629.5 7641.5
## + interest_veterans  1   7629.6 7641.6
## + catalog_shopper    1   7629.7 7641.7
## + recency            1   7629.7 7641.7
## + veteran            1   7629.9 7641.9
## + interest_religion  1   7629.9 7641.9
## 
## Step:  AIC=7638.6
## donated ~ frequency + bad_address + imputed_age + missing_age + 
##     money
## 
##                     Df Deviance    AIC
## <none>                   7626.6 7638.6
## + has_children       1   7624.6 7638.6
## + wealth_rating      3   7621.4 7639.4
## + pet_owner          1   7626.2 7640.2
## + interest_veterans  1   7626.3 7640.3
## + catalog_shopper    1   7626.4 7640.4
## + recency            1   7626.4 7640.4
## + veteran            1   7626.5 7640.5
## + interest_religion  1   7626.6 7640.6
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type="response")

# Plot the ROC of the stepwise model
ROC <- pROC::roc(donorsSmall$donated, step_prob)
plot(ROC, col = "red")

pROC::auc(ROC)
## Area under the curve: 0.5822

Chapter 4 - Classification Trees

Making decisions with trees:

  • Classification trees are helpful for if/else decision making - multiple branches leading to an outcome of interest
  • Root nodes are the beginning, and leaf nodes are the final outcomes
    • Intermediate nodes may be in between the root nodes and the leaf nodes
    • Tree-based modeling is said to follow a “divide and conquer” methodology - starts with the most important charcateristic
    • Each split is designed to maximize homogeneity of the remaining groups
  • One of the most widely-used tools for trees in R is called rpart
    • Generally, want to use method=“class” since the goal is categorical prediction (rather than continuous prediction)

Growing larger classification trees:

  • The first split is the split that produces the purest partitions
  • As the tree grows, it creates better and better splits
    • The tree generally does not consider combinations of features all at once, meaning it misses things like a perfect diagonal line split
    • For some patterns, the decision tree can be extremely complex (needlessly so based on the actual structure of the data)
    • Further, there is a tendency for a large tree to over-fit the data (variance errors, fitting on noise)
  • Due to the tendency for trees to over-fit, holding out a test set is important for assessing actual tree prediction performance

Tending to classification trees:

  • Classification trees require pruning to keep them to a good performance on the test dataset
    • Can also pre-prune, the process of stopping growing trees once it hits a key parameter (maximum depth, maximum nodes, minimum node size, etc.)
    • There are trade-offs in that a tree that stops growing too soon can miss out on real signal in the data; post-pruning a large tree can help identify this
  • The elbow in a plot of tree complexity vs. error rate is frequently a good point to post-prune the tree
  • The rpart.control() function can handle the pre-pruning of the tree
    • maxdepth - maximum depth
    • minsplit - minimum points per split
  • Processes are also available to handle post-pruning of the tree
    • The plotcp() call will help to identify the relationship between cp and accuracy
    • The best cp parameter can be fed in to the prune() call to prune the tree at this point

Seeing the forest from the trees:

  • Root nodes growing to branches (intermediate nodes) to leaf nodes that need pruning
  • Forests made from aggregations of trees (decision trees, random forests) can be very powerful for classification
    • Each of the underlying trees is typically both simple and diverse
    • Diversity of trees requires different conditions - resampled data with replacement, randomized subsets of data
    • Ensemble methods often incorporate the team’s collective strengths, even if the individual members are not too strong
  • Can use randomForest::randomForest(fmla, data=, ntree=, mtry=) # defaults are for ntree=500 and mrty=sqrt(p) where p is the number of parameters

Example code includes:

# The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club, a US-based peer-to-peer lending company
# You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application
# Then, see how the tree's predictions differ for an applicant with good credit versus one with bad credit

# The dataset loans is already in your workspace
loansRaw <- read.csv("./RInputFiles/loans.csv")
str(loansRaw)
## 'data.frame':    39732 obs. of  16 variables:
##  $ keep              : int  1 1 0 0 0 0 0 1 1 1 ...
##  $ rand              : num  0.13 0.998 0.628 0.252 0.474 ...
##  $ default           : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ loan_amount       : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 2 3 2 2 3 2 3 2 ...
##  $ emp_length        : Factor w/ 5 levels "< 2 years","10+ years",..: 2 1 2 2 1 3 4 4 3 1 ...
##  $ home_ownership    : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 4 3 4 ...
##  $ income            : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 2 3 1 2 3 3 3 2 ...
##  $ loan_purpose      : Factor w/ 14 levels "car","credit_card",..: 2 1 12 10 10 14 3 1 12 10 ...
##  $ debt_to_income    : Factor w/ 3 levels "AVERAGE","HIGH",..: 2 3 1 2 1 1 2 3 3 1 ...
##  $ credit_score      : Factor w/ 3 levels "AVERAGE","HIGH",..: 1 1 1 1 1 1 1 3 1 1 ...
##  $ recent_inquiry    : Factor w/ 2 levels "NO","YES": 2 2 2 2 1 2 2 2 2 1 ...
##  $ delinquent        : Factor w/ 3 levels "IN PAST 2 YEARS",..: 3 3 3 2 2 3 3 3 3 3 ...
##  $ credit_accounts   : Factor w/ 3 levels "AVERAGE","FEW",..: 2 2 2 1 3 1 1 2 1 2 ...
##  $ bad_public_record : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ credit_utilization: Factor w/ 3 levels "HIGH","LOW","MEDIUM": 1 2 1 2 3 3 1 1 3 3 ...
##  $ past_bankrupt     : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
loans <- loansRaw %>% 
    filter(keep == 1) %>% 
    mutate(outcome=factor(default, levels=c(1, 0), labels=c("default", "repaid"))) %>%
    select(-keep, -rand, -default)
str(loans)
## 'data.frame':    11312 obs. of  14 variables:
##  $ loan_amount       : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 2 3 2 3 3 2 1 3 ...
##  $ emp_length        : Factor w/ 5 levels "< 2 years","10+ years",..: 2 1 4 3 1 1 3 2 2 1 ...
##  $ home_ownership    : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 3 4 4 4 1 4 4 ...
##  $ income            : Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 3 3 2 2 1 1 1 3 ...
##  $ loan_purpose      : Factor w/ 14 levels "car","credit_card",..: 2 1 1 12 10 3 10 7 3 7 ...
##  $ debt_to_income    : Factor w/ 3 levels "AVERAGE","HIGH",..: 2 3 3 3 1 1 3 1 1 3 ...
##  $ credit_score      : Factor w/ 3 levels "AVERAGE","HIGH",..: 1 1 3 1 1 1 1 2 1 1 ...
##  $ recent_inquiry    : Factor w/ 2 levels "NO","YES": 2 2 2 2 1 2 2 1 1 2 ...
##  $ delinquent        : Factor w/ 3 levels "IN PAST 2 YEARS",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ credit_accounts   : Factor w/ 3 levels "AVERAGE","FEW",..: 2 2 2 1 2 2 3 3 1 1 ...
##  $ bad_public_record : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ credit_utilization: Factor w/ 3 levels "HIGH","LOW","MEDIUM": 1 2 1 3 3 1 3 2 1 3 ...
##  $ past_bankrupt     : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ outcome           : Factor w/ 2 levels "default","repaid": 2 1 2 1 1 1 1 2 1 1 ...
# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart::rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart::rpart.control(cp = 0))

good_credit_raw <- c(2, 1, 1, 1, 7, 1, 2, 1, 3, 3, 1, 2, 1, 2)
bad_credit_raw <- c(2, 3, 4, 3, 1, 3, 3, 2, 3, 2, 1, 1, 1, 2)

loansTest <- loans %>% filter(FALSE)
for (intCtr in seq_len(ncol(loansTest))) {
    loansTest[1, intCtr] <- levels(loansTest[, intCtr])[good_credit_raw[intCtr]]
    loansTest[2, intCtr] <- levels(loansTest[, intCtr])[bad_credit_raw[intCtr]]
}

good_credit <- loansTest[1, ,drop=FALSE]
bad_credit <- loansTest[2, ,drop=FALSE]

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
##      1 
## repaid 
## Levels: default repaid
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
##       2 
## default 
## Levels: default repaid
# Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected
# The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions

# Examine the loan_model object
loan_model
## n= 11312 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11312 5654 repaid (0.4998232 0.5001768)  
##    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
##      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
##      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
##       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
##       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
##         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
##         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
##    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *
# Plot the loan_model with default settings
rpart.plot::rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot::rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

# Determine the number of rows for training
nrow(loans) * 0.75
## [1] 8484
# Create a random sample of row IDs
sample_rows <- sample(1:nrow(loans), round(nrow(loans) * 0.75))

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]


# The 'rpart' package is loaded into the workspace
# The loans_train and loans_test datasets have been created

# Grow a tree using all of the available applicant data
loan_model <- rpart::rpart(outcome ~ ., data = loans_train, method = "class", control = rpart::rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, newdata=loans_test, type="class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)
##          
##           default repaid
##   default     783    628
##   repaid      616    801
# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5601132
# The 'rpart' package is loaded into the workspace

# Grow a tree with maxdepth of 6
loan_model <- rpart::rpart(outcome ~ ., data = loans_train, method = "class", control = rpart::rpart.control(cp = 0, maxdepth=6))

# Compute the accuracy of the simpler tree
loans_test$pred <- predict(loan_model, newdata=loans_test, type="class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5990099
# Grow a tree with minsplit of 500
loan_model2 <- rpart::rpart(outcome ~ ., data = loans_train, method = "class", control = rpart::rpart.control(cp = 0, minsplit=500))

# Compute the accuracy of the simpler tree
loans_test$pred2 <- predict(loan_model2, newdata=loans_test, type="class")
mean(loans_test$pred2 == loans_test$outcome)
## [1] 0.5915842
# Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later
# By using post-pruning, you can intentionally grow a large and complex then prune it to be smaller and more efficient later on
# In this exercise, you will have the opportunity to construct a visualization of the tree's performance versus complexity, and use this information to prune the tree to an appropriate level

# The 'rpart' package is loaded into the workspace

# Grow an overly complex tree
loan_model <- rpart::rpart(outcome ~ ., data = loans_train, method = "class", control = rpart::rpart.control(cp = 0))

# Examine the complexity plot
rpart::plotcp(loan_model)

# Prune the tree
loan_model_pruned <- rpart::prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, newdata=loans_test, type="class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5983027
# Build a random forest model
loan_model <- randomForest::randomForest(outcome ~ ., data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, newdata=loans_test)
mean(loans_test$pred == loans_test$outcome)
## [1] 0.582744

Introduction to R Using sparklyr

Chapter 1 - Starting Spark with dplyr

Getting started - using Spark with dplyr:

  • By default, R is limited to storage in RAM on a single dataset
  • Spark is a cluster computing capability, effectively removing the size limits for the dataset
    • sparklyr is an R package that allows you to access Spark from R
    • As an added bonus, sparklyr uses the dplyr syntax
  • There is sometimes so under-development of the packages and the interactions; some hands-on problem solving may be needed
  • The available dplyr verbs are as might be expected - select, filter, arrange, mutate, summarize

Example code (not evaluated) includes:

# Before you get too excited, a word of warning
# Spark is still a very new technology, and some niceties like clear error messages aren't there yet
# So when things go wrong, it can be hard to understand why.

# sparklyr is newer, and doesn't have a full set of features
# There are some things that you just can't do with Spark from R right now. The Scala and Python interfaces to Spark are more mature

# That means that you are sailing into uncharted territory with this course
# The trip may be a little rough, so be prepared to be out of your comfort zone occasionally

# One further note of caution is that in this course you'll be running code on your own personal Spark mini-cluster in the DataCamp cloud
# This is ideal for learning the concepts of how to use Spark, but you won't get the same performance boost as you would using a remote cluster on a high-performance server
# That means that the examples here won't run faster than if you were only using R, but you can use the skills you learn here to run analyses on your own big datasets

# If you wish to install Spark on your local system, simply install the sparklyr package and call spark_install()

# Working with sparklyr is very much like working with dplyr when you have data inside a database
# In fact, sparklyr converts your R code into SQL code before passing it to Spark

# The typical workflow has three steps
# Connect to Spark using spark_connect().
# Do some work.
# Close the connection to Spark using spark_disconnect()

# In this exercise, you'll do this simplest possible piece of work: returning the version of Spark that is running, using spark_version()

# spark_connect() takes a URL that gives the location to Spark
# For a local cluster (as you are running), the URL should be "local"
# For a remote cluster (on another machine, typically a high-performance server), the connection string will be a URL and port to connect on

# spark_version() and spark_disconnect() both take the Spark connection as their only argument

# One word of warning. Connecting to a cluster takes several seconds, so it is impractical to regularly connect and disconnect
# While you need to reconnect for each DataCamp exercise, when you incorporate sparklyr into your own workflow, it is usually best to keep the connection open for the whole time that you want to work with Spark

# Load sparklyr
library(sparklyr)

# Connect to your Spark cluster
spark_conn <- spark_connect(master="local")

# Print the version of Spark
spark_version(sc=spark_conn)

# Disconnect from Spark
spark_disconnect(sc=spark_conn)


# Before you can do any real work using Spark, you need to get your data into it. sparklyr has some functions such as spark_read_csv() that will read a CSV file into Spark
# More generally, it is useful to be able to copy data from R to Spark
# This is done with dplyr's copy_to() function
# Be warned: copying data is a fundamentally slow process
# In fact, a lot of strategy regarding optimizing performance when working with big datasets is to find ways of avoiding copying the data from one location to another

# copy_to() takes two arguments: a Spark connection (dest), and a data frame (df) to copy over to Spark

# Once you have copied your data into Spark, you might want some reassurance that it has actually worked
# You can see a list of all the data frames stored in Spark using src_tbls(), which simply takes a Spark connection argument (x)

# Throughout the course, you will explore track metadata from the Million Song Dataset
# While Spark will happily scale well past a million rows of data, to keep things simple and responsive, you will use a thousand track subset
# To clarify the terminology: a track refers to a row in the dataset
# For your thousand track dataset, this is the same thing as a song (though the full million row dataset suffered from some duplicate songs)

# Load dplyr
library(dplyr)

# Explore track_metadata structure
str(track_metadata)

# Connect to your Spark cluster
spark_conn <- spark_connect(master="local")

# Copy track_metadata to Spark
track_metadata_tbl <- copy_to(spark_conn, track_metadata)

# List the data frames available in Spark
src_tbls(spark_conn)

# Disconnect from Spark
spark_disconnect(sc=spark_conn)


# In the last exercise, when you copied the data to Spark, copy_to() returned a value
# This return value is a special kind of tibble() that doesn't contain any data of its own
# To explain this, you need to know a bit about the way that tidyverse packages store data
# Tibbles are usually just a variant of data.frames that have a nicer print method
# However, dplyr also allows them to store data from a remote data source, such as databases, and – as is the case here – Spark
# For remote datasets, the tibble object simply stores a connection to the remote data
# This will be discussed in more detail later, but the important point for now is that even though you have a big dataset, the size of the tibble object is small

# On the Spark side, the data is stored in a variable called a DataFrame
# This is a more or less direct equivalent of R's data.frame variable type. (Though the column variable types are named slightly differently – for example numeric columns are called DoubleType columns.)
# Throughout the course, the term data frame will be used, unless clarification is needed between data.frame and DataFrame
# Since these types are also analogous to database tables, sometimes the term table will also be used to describe this sort of rectangular data

# Calling tbl() with a Spark connection, and a string naming the Spark data frame will return the same tibble object that was returned when you used copy_to()

# A useful tool that you will see in this exercise is the object_size() function from the pryr package
# This shows you how much memory an object takes up

# A Spark connection has been created for you as spark_conn
# The track metadata for 1,000 tracks is stored in the Spark cluster in the table "track_metadata"

# Link to the track_metadata table in Spark
track_metadata_tbl <- tbl(spark_conn, "track_metadata")

# See how big the dataset is
dim(track_metadata_tbl)

# See how small the tibble is
object_size(track_metadata_tbl)


# If you try to print a tibble that describes data stored in Spark, some magic has to happen, since the tibble doesn't keep a copy of the data itself
# The magic is that the print method uses your Spark connection, copies some of the contents back to R, and displays those values as though the data had been stored locally
# As you saw earlier in the chapter, copying data is a slow operation, so by default, only 10 rows and as many columns will fit onscreen, are printed

# You can change the number of rows that are printed using the n argument to print()
# You can also change the width of content to display using the width argument, which is specified as the number of characters (not the number of columns)
# A nice trick is to use width = Inf to print all the columns

# The str() function is typically used to display the structure of a variable
# For data.frames, it gives a nice summary with the type and first few values of each column
# For tibbles that have a remote data source however, str() doesn't know how to retrieve the data
# That means that if you call str() on a tibble that contains data stored in Spark, you see a list containing a Spark connection object, and a few other bits and pieces

# If you want to see a summary of what each column contains in the dataset that the tibble refers to, you need to call glimpse() instead
# Note that for remote data such as those stored in a Spark cluster datasets, the number of rows is a lie!
# In this case, glimpse() never claims that the data has more than 25 rows

# Print 5 rows, all columns
print(track_metadata_tbl, n=5, width=Inf)

# Examine structure of tibble
str(track_metadata_tbl)

# Examine structure of data
glimpse(track_metadata_tbl)


# The easiest way to manipulate data frames stored in Spark is to use dplyr syntax
# Manipulating data frames using the dplyr syntax is covered in detail in the Data Manipulation in R with dplyr and Joining Data in R with dplyr courses, but you'll spend the next chapter and a half covering all the important points

# dplyr has five main actions that you can perform on a data frame
# You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics.
# Note that square bracket indexing is not currently supported in sparklyr
# So you cannot do a_tibble[, c("x", "y", "z")]
# a_tibble %>% select(x, y, z)

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  select(artist_name, release, title, year)

# Try to select columns using [ ]
tryCatch({
    # Selection code here
    track_metadata_tbl[, c("artist_name", "release", "title", "year")]
  },
  error = print
)


# tryCatch(error = print) is a nice way to see errors without them stopping the execution of your code.

# Before you try the exercise, take heed of two warnings
# Firstly, don't mistake dplyr's filter() function with the stats package's filter() function
# Secondly, sparklyr converts your dplyr code into SQL database code before passing it to Spark
# That means that only a limited number of filtering operations are currently supported
# For example, you can't filter character rows using regular expressions with code like a_tibble %>% filter(grepl("a regex", x))

# The help page for translate_sql() describes the functionality that is available
# You are OK to use comparison operators like >, !=, and %in%; arithmetic operators like +, ^, and %%; and logical operators like &, | and !
# Many mathematical functions such as log(), abs(), round(), and sin() are also supported

# track_metadata_tbl has been pre-defined
glimpse(track_metadata_tbl)

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  select(artist_name, release, title, year) %>%
  # Filter rows
  filter(year >= 1960, year < 1970)


# Notice the use of desc() to enforce sorting by descending order
# Also be aware that in sparklyr, the order() function, used for arranging the rows of data.frames does not work.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  select(artist_name, release, title, year) %>%
  # Filter rows
  filter(year >= 1960, year < 1970) %>%
  # Arrange rows
  arrange(artist_name, desc(year), title)


# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  select(title, duration) %>%
  # Mutate columns
  mutate(duration_minutes = duration/60)


# Note that dplyr has a philosophy (passed on to sparklyr) of always keeping the data in tibbles
# So the return value here is a tibble with one row, and one column for each summary statistic that was calculated

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  select(title, duration) %>%
  # Mutate columns
  mutate(duration_minutes=duration/60) %>%
  # Summarize columns
  summarize(mean_duration_minutes = mean(duration_minutes))

Chapter 2 - Advanced dplyr

Levelling Up:

  • Can use some of the helpers like starts_with() inside the filter and select and the like
  • SQL and database joins can be nice add-ons to the process

Example code includes:

# If your dataset has thousands of columns, and you want to select a lot of them, then typing the name of each column when you call select() can be very tedious
# Fortunately, select() has some helper functions to make it easy to select multiple columns without typing much code

# These helpers include starts_with() and ends_with(), that match columns that start or end with a certain prefix or suffix respectively
# Due to dplyr's special code evaluation techniques, these functions can only be called from inside a call to select(); they don't make sense on their own

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Select columns starting with artist
  select(starts_with("artist"))

track_metadata_tbl %>%
  # Select columns ending with id
  select(ends_with("id"))


# A more general way of matching columns is to check if their names contain a value anywhere within them (rather than starting or ending with a value)
# As you may be able to guess, you can do this using a helper named contains()

# Even more generally, you can match columns using regular expressions
# Regular expressions ("regexes" for short) are a powerful language used for matching text
# If you want to learn how to use regular expressions, take the String Manipulation in R with stringr course
# For now, you only need to know three things

# a: A letter means "match that letter".
# .: A dot means "match any character, including letters, numbers, punctuation, etc.".
# ?: A question mark means "the previous character is optional".

# You can find columns that match a particular regex using the matches() select helper

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Select columns containing ti
  select(contains("ti"))

track_metadata_tbl %>%
  # Select columns matching ti.?t
  select(matches("ti.?t"))


# If you have a categorical variable stored in a factor, it is often useful to know what the individual categories are; you do this with the levels() function
# For a tibble, the more general concept is to find rows with unique data. Following the terminology from SQL, this is done using the distinct() function
# You can use it directly on your dataset, so you find unique combinations of a particular set of columns
# For example, to find the unique combinations of values in the x, y, and z columns, you would write the following: a_tibble %>% distinct(x, y, z)

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Only return rows with distinct artist_name
  distinct(artist_name)


# The distinct() function showed you the unique values
# It can also be useful to know how many of each value you have
# The base-R function for this is table(); that isn't supported in sparklyr since it doesn't conform to the tidyverse philosophy of keeping everything in tibbles
# Instead, you must use count(). To use it, pass the unquoted names of the columns
# For example, to find the counts of distinct combinations of columns x, y, and z, you would type the following: a_tibble %>% count(x, y, z)
# The result is the same as a_tibble %>% distinct(x, y, z) except that you get an extra column, n, that contains the counts

# A really nice use of count() is to get the most common values of something
# To do this, you call count(), with the argument sort = TRUE which sorts the rows by descending values of the n column, then use top_n() to restrict the results to the top however-many values
# (top_n() is similar to base-R's head(), but it works with remote datasets such as those in Spark.)
# For example, to get the top 20 most common combinations of the x, y, and z columns, use the following a_tibble %>% count(x, y, z, sort = TRUE) %>% top_n(20)

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Count the artist_name values
  count(artist_name, sort=TRUE) %>%
  # Restrict to top 20
  top_n(20)


# To collect your data: that is, to move it from Spark to R, you call collect()

# track_metadata_tbl has been pre-defined
track_metadata_tbl

results <- track_metadata_tbl %>%
  # Filter where artist familiarity is greater than 0.9
  filter(artist_familiarity > 0.9)

# Examine the class of the results
class(results)

# Collect your results
collected <- results %>%
  collect()

# Examine the class of the collected results
class(collected)


# You often want to store the results of intermediate calculations, but you don't want to collect them because it is slow
# The solution is to use compute() to compute the calculation, but store the results in a temporary data frame on Spark
# Compute takes two arguments: a tibble, and a variable name for the Spark data frame that will store the results

# track_metadata_tbl has been pre-defined
track_metadata_tbl

computed <- track_metadata_tbl %>%
  # Filter where artist familiarity is greater than 0.8
  filter(artist_familiarity > 0.9) %>%
  # Compute the results
  compute("familiar_artists")

# See the available datasets
src_tbls(spark_conn)

# Examine the class of the computed results
class(computed)


# track_metadata_tbl has been pre-defined
track_metadata_tbl

duration_by_artist <- track_metadata_tbl %>%
  # Group by artist
  group_by(artist_name) %>%
  # Calc mean duration
  summarize(mean_duration = mean(duration))

duration_by_artist %>%
  # Sort by ascending mean duration
  arrange(mean_duration)

duration_by_artist %>%
  # Sort by descending mean duration
  arrange(desc(mean_duration))


# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Group by artist
  group_by(artist_name) %>%
  # Calc time since first release
  mutate(time_since_first_release = year-min(year)) %>%
  # Arrange by descending time since first release
  arrange(desc(time_since_first_release))


# As previously mentioned, when you use the dplyr interface, sparklyr converts your code into SQL before passing it to Spark
# Most of the time, this is what you want
# However, you can also write raw SQL to accomplish the same task
# Most of the time, this is a silly idea since the code is harder to write and harder to debug
# However, if you want your code to be portable – that is, used outside of R as well – then it may be useful
# For example, a fairly common workflow is to use sparklyr to experiment with data processing, then switch to raw SQL in a production environment
# By writing raw SQL to begin with, you can just copy and paste your queries when you move to production

# SQL queries are written as strings, and passed to dbGetQuery() from the DBI package. The pattern is as follows.
# query <- "SELECT col1, col2 FROM some_data WHERE some_condition"
# a_data.frame <- dbGetQuery(spark_conn, query)

# Note that unlike the dplyr code you've written, dbGetQuery() will always execute the query and return the results to R immediately
# If you want to delay returning the data, you can use dbSendQuery() to execute the query, then dbFetch() to return the results
# That's more advanced usage, not covered here
# Also note that DBI functions return data.frames rather than tibbles, since DBI is a lower-level package

# Write SQL query
query <- "SELECT * FROM track_metadata WHERE year < 1935 AND duration > 300"

# Run the query
(results <- dbGetQuery(spark_conn, query))


# A Spark connection has been created for you as spark_conn
# Tibbles attached to the track metadata and artist terms stored in Spark have been pre-defined as track_metadata_tbl and artist_terms_tbl respectively

# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl

# Left join artist terms to track metadata by artist_id
joined <- left_join(track_metadata_tbl, artist_terms_tbl, by = c("artist_id"))

# How many rows and columns are in the joined table?
dim(joined)


# Anti joins are really useful for finding problems with other joins.
# An anti join returns the rows of the first table where it cannot find a match in the second table. 

# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl

# Anti join artist terms to track metadata by artist_id
joined <- anti_join(track_metadata_tbl, artist_terms_tbl, by=c("artist_id"))

# How many rows and columns are in the joined table?
dim(joined)


# Semi joins are the opposite of anti joins: an anti-anti join, if you like
# A semi join returns the rows of the first table where it can find a match in the second table

# You may have spotted that the results of a semi join plus the results of an anti join give the orignial table
# So, regardless of the table contents or how you join them, semi_join(A, B) plus anti_join(A, B) will return A (though maybe with the rows in a different order)

# track_metadata_tbl and artist_terms_tbl have been pre-defined
track_metadata_tbl
artist_terms_tbl

# Semi join artist terms to track metadata by artist_id
joined <- semi_join(track_metadata_tbl, artist_terms_tbl, by=c("artist_id"))

# How many rows and columns are in the joined table?
dim(joined)

Chapter 3 - Native Interfaces to Manipulate Spark DataFrames

Two new interfaces:

  • The MLlib machine learning interface, with feature transformation functions named ft_, and machine learning transformations named ml_
    • Cutting a numerical field to a categorical variables
  • The Spark DataFrame API for sorting, sampling, and partitioning a dataset

Example code includes:

# The dplyr methods that you saw in the previous two chapters use Spark's SQL interface
# That is, they convert your R code into SQL code before passing it to Spark
# This is an excellent solution for basic data manipulation, but it runs into problems when you want to do more complicated processing
# For example, you can calculate the mean of a column, but not the median

# sparklyr also has two "native" interfaces that will be discussed in the next two chapters
# Native means that they call Java or Scala code to access Spark libraries directly, without any conversion to SQL
# sparklyr supports the Spark DataFrame Application Programming Interface (API), with functions that have an sdf_ prefix
# It also supports access to Spark's machine learning library, MLlib, with "feature transformation" functions that begin ft_, and "machine learning" functions that begin ml_.

# One important philosophical difference between working with R and working with Spark is that Spark is much stricter about variable types than R
# Most of the native functions want DoubleType inputs and return DoubleType outputs
# DoubleType is Spark's equivalent of R's numeric vector type. sparklyr will handle converting numeric to DoubleType, but it is up to the user (that's you!) to convert logical or integer data into numeric data and back again

# Logical variables are nice because it is often easier to think about things in "yes or no" terms rather than in numeric terms
# For example, if someone asks you "Would you like a cup of tea?", a yes or no response is preferable to "There is a 0.73 chance of me wanting a cup of tea"
# This has real data science applications too
# For example, a test for diabetes may return the glucose concentration in a patient's blood plasma as a number
# What you really care about is "Does the patient have diabetes?", so you need to convert the number into a logical value, based upon some threshold

# All the sparklyr feature transformation functions have a similar user interface
# The first three arguments are always a Spark tibble, a string naming the input column, and a string naming the output column
# That is, they follow this pattern: a_tibble %>% ft_some_transformation("x", "y", some_other_args)

# The sparklyr way of converting a continuous variable into logical uses ft_binarizer()
# The previous diabetes example can be rewritten as the following
# Note that the threshold value should be a number, not a string refering to a column in the dataset

# diabetes_data %>% ft_binarizer("plasma_glucose_concentration", "has_diabetes", threshold = threshold_mmol_per_l)

# In keeping with the Spark philosophy of using DoubleType everywhere, the output from ft_binarizer() isn't actually logical; it is numeric
# This is the correct approach for letting you continue to work in Spark and perform other transformations, but if you want to process your data in R, you have to remember to explicitly convert the data to logical
# The following is a common code pattern: a_tibble %>% ft_binarizer("x", "is_x_big", threshold = threshold) %>% collect() %>% mutate(is_x_big = as.logical(is_x_big))

# This exercise considers the appallingly named artist_hotttnesss field, which provides a measure of how much media buzz the artist had at the time the dataset was created
# If you would like to learn more about drawing plots using the ggplot2 package, please take the Data Visualization with ggplot2 (Part 1) course.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

hotttnesss <- track_metadata_tbl %>%
  # Select artist_hotttnesss
  select(artist_hotttnesss) %>%
  # Binarize to is_hottt_or_nottt
  ft_binarizer("artist_hotttnesss", "is_hottt_or_nottt", threshold = 0.5) %>%
  # Collect the result
  collect() %>%
  # Convert is_hottt_or_nottt to logical
  mutate(is_hottt_or_nottt = as.logical(is_hottt_or_nottt))

# Draw a barplot of is_hottt_or_nottt
ggplot(hotttnesss, aes(is_hottt_or_nottt)) +
  geom_bar()


# A generalization of the previous idea is to have multiple thresholds; that is, you split a continuous variable into "buckets" (or "bins"), just like a histogram does
# In base-R, you would use cut() for this task
# For example, in a study on smoking habits, you could take the typical number of cigarettes smoked per day, and transform it into a factor

# smoking_status <- cut(
#   cigarettes_per_day,
#   breaks = c(0, 1, 10, 20, Inf),
#   labels = c("non", "light", "moderate", "heavy"),
#   right  = FALSE
# )

# The sparklyr equivalent of this is to use ft_bucketizer()
# The code takes a similar format to ft_binarizer(), but this time you must pass a vector of cut points to the splits argument
# Here is the same example rewritten in sparklyr style: smoking_data %>% ft_bucketizer("cigarettes_per_day", "smoking_status", splits = c(0, 1, 10, 20, Inf))

# There are several important things to note
# You may have spotted that the breaks argument from cut() is the same as the splits argument from ft_bucketizer()
# There is a slight difference in how values on the boundary are handled
# In cut(), by default, the upper (right-hand) boundary is included in each bucket, but not the left
# ft_bucketizer() includes the lower (left-hand) boundary in each bucket, but not the right
# This means that it is equivalent to calling cut() with the argument right = FALSE

# One exception is that ft_bucketizer() includes values on both boundaries for the upper-most bucket
# So ft_bucketizer() is also equivalent to setting include.lowest = TRUE when using cut()

# The final thing to note is that whereas cut() returns a factor, ft_bucketizer() returns a numeric vector, with values in the first bucket returned as zero, values in the second bucket returned as one, values in the third bucket returned as two, and so on
# If you want to work on the results in R, you need to explicitly convert to a factor
# This is a common code pattern: a_tibble %>% ft_bucketizer("x", "x_buckets", splits = splits) %>% collect() %>% mutate(x_buckets = factor(x_buckets, labels = labels)

# A Spark connection has been created for you as spark_conn
# A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. decades is a numeric sequence of 1920, 1930, ..., 2020, and decade_labels is a text description of those decades

# track_metadata_tbl, decades, decade_labels have been pre-defined
track_metadata_tbl
decades
decade_labels

hotttnesss_over_time <- track_metadata_tbl %>%
  # Select artist_hotttnesss and year
  select(artist_hotttnesss, year) %>%
  # Convert year to numeric
  mutate(year=as.numeric(year)) %>%
  # Bucketize year to decade using decades vector
  ft_bucketizer("year", "decade", split=decades) %>%
  # Collect the result
  collect() %>%
  # Convert decade to factor using decade_labels
  mutate(decade=factor(decade, labels=decade_labels))

# Draw a boxplot of artist_hotttnesss by decade
ggplot(hotttnesss_over_time, aes(decade, artist_hotttnesss)) +
  geom_boxplot()  


# A special case of the previous transformation is to cut a continuous variable into buckets where the buckets are defined by quantiles of the variable
# A common use of this transformation is to analyze survey responses or review scores
# If you ask people to rate something from one to five stars, often the median response won't be three stars
# In this case, it can be useful to split their scores up by quantile
# For example, you can make five quintile groups by splitting at the 0th, 20th, 40th, 60th, 80th, and 100th percentiles

# The base-R way of doing this is cut() + quantile()
# The sparklyr equivalent uses the ft_quantile_discretizer() transformation
# This takes an n.buckets argument, which determines the number of buckets
# The base-R and sparklyr ways of calculating this are shown together
# As before, right = FALSE and include.lowest are set

# survey_response_group <- cut(
#   survey_score,
#   breaks = quantile(survey_score, c(0, 0.25, 0.5, 0.75, 1)),
#   labels = c("hate it", "dislike it", "like it", "love it"),
#   right  = FALSE,
#   include.lowest = TRUE
# )

# survey_data %>% ft_quantile_discretizer("survey_score", "survey_response_group", n.buckets = 4)
  
# As with ft_bucketizer(), the resulting bins are numbers, counting from zero
# If you want to work with them in R, explictly convert to a factor

# A Spark connection has been created for you as spark_conn
# A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
# duration_labels is a character vector describing lengths of time.

# track_metadata_tbl, duration_labels have been pre-defined
track_metadata_tbl
duration_labels

familiarity_by_duration <- track_metadata_tbl %>%
  # Select duration and artist_familiarity
  select(duration, artist_familiarity) %>%
  # Bucketize duration
  ft_quantile_discretizer("duration", "duration_bin", n.buckets=5) %>%
  # Collect the result
  collect() %>%
  # Convert duration bin to factor
  mutate(duration_bin=factor(duration_bin, labels=duration_labels))

# Draw a boxplot of artist_familiarity by duration_bin
ggplot(familiarity_by_duration, aes(duration_bin, artist_familiarity)) +
  geom_boxplot()


# Common uses of text-mining include analyzing shopping reviews to ascertain purchasers' feeling about the product, or analyzing financial news to predict the sentiment regarding stock prices
# In order to analyze text data, common pre-processing steps are to convert the text to lower-case (see tolower()), and to split sentences into individual words.

# ft_tokenizer() performs both these steps
# Its usage takes the same pattern as the other transformations that you have seen, with no other arguments

# shop_reviews %>% ft_tokenizer("review_text", "review_words")

# Since the output can contain a different number of words in each row, output.col is a list column, where every element is a list of strings
# To analyze text data, it is usually preferable to have one word per row in the data
# The list-of-list-of-strings format can be transformed to a single character vector using unnest() from the tidyr package
# There is currently no method for unnesting data on Spark, so for now, you have to collect it to R before transforming it
# The code pattern to achieve this is as follows

# library(tidyr)
# text_data %>%
#   ft_tokenizer("sentences", "word") %>%
#   collect() %>%
#   mutate(word = lapply(word, as.character)) %>%
#   unnest(word)

# A Spark connection has been created for you as spark_conn
# A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl

# track_metadata_tbl has been pre-defined
track_metadata_tbl

title_text <- track_metadata_tbl %>%
  # Select artist_name, title
  select(artist_name, title) %>%
  # Tokenize title to words
  ft_tokenizer("title", "word") %>%
  # Collect the result
  collect() %>%
  # Flatten the word column 
  mutate(word = lapply(word, as.character)) %>% 
  # Unnest the list column
  unnest()


# Sentiment analysis essentially lets you assign a score or emotion to each word
# For example, in the AFINN lexicon, the word "outstanding" has a score of +5, since it is almost always used in a positive context. "grace" is a slightly positive word, and has a score of +1
# "fraud" is usually used in a negative context, and has a score of -4
# The AFINN scores dataset is returned by get_sentiments("afinn")
# For convenience, the unnested word data and the sentiment lexicon have been copied to Spark

# Typically, you want to compare the sentiment of several groups of data
# To do this, the code pattern is as follows
# text_data %>%
#   inner_join(sentiments, by = "word") %>%
#   group_by(some_group) %>%
#   summarize(positivity = sum(score))

# An inner join takes all the values from the first table, and looks for matches in the second table
# If it finds a match, it adds the data from the second table
# Unlike a left join, it will drop any rows where it doesn't find a match
# The principle is shown in this diagram

# Like left joins, inner joins are a type of mutating join, since they add columns to the first table
# See if you can guess which function to use for inner joins, and how to use it. (Hint: the usage is really similar to left_join(), anti_join(), and semi_join()

# A Spark connection has been created for you as spark_conn
# Tibbles attached to the title words and sentiment lexicon stored in Spark have been pre-defined as title_text_tbl and afinn_sentiments_tbl respectively

# title_text_tbl, afinn_sentiments_tbl have been pre-defined
title_text_tbl
afinn_sentiments_tbl

sentimental_artists <- title_text_tbl %>%
  # Inner join with sentiments on word field
  inner_join(afinn_sentiments_tbl, by="word") %>%
  # Group by artist
  group_by(artist_name) %>%
  # Summarize to get positivity
  summarize(positivity = sum(score))

sentimental_artists %>%
  # Arrange by ascending positivity
  arrange(positivity) %>%
  # Get top 5
  top_n(5)

sentimental_artists %>%
  # Arrange by descending positivity
  arrange(desc(positivity)) %>%
  # Get top 5
  top_n(5)


# ft_tokenizer() uses a simple technique to generate words by splitting text data on spaces
# For more advanced usage, you can use regular expressions to split the text data
# This is done via the ft_regex_tokenizer() function, which has the same usage as ft_tokenizer(), but with an extra pattern argument for the splitter
# a_tibble %>% ft_regex_tokenizer("x", "y", pattern = regex_pattern)

# The return value from ft_regex_tokenizer(), like ft_tokenizer(), is a list of lists of character vectors
# The dataset contains a field named artist_mbid that contains an ID for the artist on MusicBrainz, a music metadata encyclopedia website
# The IDs take the form of hexadecimal numbers split by hyphens, for example, 65b785d9-499f-48e6-9063-3a1fd1bd488d

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Select artist_mbid column
  select(artist_mbid) %>%
  # Split it by hyphens
  ft_regex_tokenizer("artist_mbid", "artist_mbid_chunks", pattern="-")


# So far in this chapter, you've explored some feature transformation functions from Spark's Mllib
# sparklyr also provides access to some functions making use of the Spark DataFrame API

# The dplyr way of sorting a tibble is to use arrange()
# You can also sort tibbles using Spark's DataFrame API using sdf_sort()
# This function takes a character vector of columns to sort on, and currently only sorting in ascending order is supported

# For example, to sort by column x, then (in the event of ties) by column y, then by column z, the following code compares the dplyr and Spark DataFrame approaches
# a_tibble %>% arrange(x, y, z)
# a_tibble %>% sdf_sort(c("x", "y", "z"))

# To see which method is faster, try using both arrange(), and sdf_sort()
# You can see how long your code takes to run by wrapping it in microbenchmark(), from the package of the same name
# Sometimes native methods are faster than the dplyr equivalent; sometimes it is the other way around
# Profile your code if you need to see where the slowness occurs.


# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Compare timings of arrange() and sdf_sort()
microbenchmark(
  arranged = track_metadata_tbl %>%
    # Arrange by year, then artist_name, then release, then title
    arrange(year, artist_name, release, title) %>%
    # Collect the result
    collect(),
  sorted = track_metadata_tbl %>%
    # Sort by year, then artist_name, then release, then title
    sdf_sort(c("year", "artist_name", "release", "title")) %>%
    # Collect the result
    collect(),
  times = 5
)


# sparklyr has a function named sdf_schema() for exploring the columns of a tibble on the R side
# It's easy to call; and a little painful to deal with the return value
# sdf_schema(a_tibble)

# The return value is a list, and each element is a list with two elements, containing the name and data type of each column
# The exercise shows a data transformation to more easily view the data types

# Here is a comparison of how R data types map to Spark data types
# Other data types are not currently supported by sparklyr
# R type logical is Spark type BooleanType
# R type numeric is Spark type DoubleType
# R type integer is Spark type IntegerType
# R type character is Spark type StringType
# R type list is Spark type ArrayType

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Get the schema
(schema <- sdf_schema(track_metadata_tbl))

# Transform the schema
schema %>%
  lapply(function(x) do.call(data_frame, x)) %>%
  bind_rows()


# When you are working with a big dataset, you typically don't really need to work with all of it all the time
# Particularly at the start of your project, while you are experimenting wildly with what you want to do, you can often iterate more quickly by working on a smaller subset of the data
# sdf_sample() provides a convenient way to do this
# It takes a tibble, and the fraction of rows to return
# In this case, you want to sample without replacement
# To get a random sample of one tenth of your dataset, you would use the following code: a_tibble %>% sdf_sample(fraction = 0.1, replacement = FALSE)

# Since the results of the sampling are random, and you will likely want to reuse the shrunken dataset, it is common to use compute() to store the results as another Spark data frame

# To make the results reproducible, you can also set a random number seed via the seed argument
# Doing this means that you get the same random dataset every time you run your code
# It doesn't matter which number you use for the seed; just choose your favorite positive integer

# track_metadata_tbl has been pre-defined
track_metadata_tbl

track_metadata_tbl %>%
  # Sample the data without replacement
  sdf_sample(fraction = 0.01, replacement=FALSE, seed=20000229) %>%
  # Compute the result
  compute("sample_track_metadata")


# Most of the time, when you run a predictive model, you need to fit the model on one subset of your data (the "training" set), then test the model predictions against the rest of your data (the "testing" set)

# sdf_partition() provides a way of partitioning your data frame into training and testing sets
# Its usage is as follows: a_tibble %>% sdf_partition(training = 0.7, testing = 0.3)

# There are two things to note about the usage
# Firstly, if the partition values don't add up to one, they will be scaled so that they do
# So if you passed training = 0.35 and testing = 0.15, you'd get double what you asked for
# Secondly, you can use any set names that you like, and partition the data into more than two sets
# So the following is also valid: a_tibble %>% sdf_partition(a = 0.1, b = 0.2, c = 0.3, d = 0.4)

# The return value is a list of tibbles. you can access each one using the usual list indexing operators
# partitioned$a
# partitioned[["b"]]

# track_metadata_tbl has been pre-defined
track_metadata_tbl

partitioned <- track_metadata_tbl %>%
  # Partition into training and testing sets
  sdf_partition(training = 0.7, testing=0.3)

# Get the dimensions of the training set
dim(partitioned$training)

# Get the dimensions of the testing set
dim(partitioned$testing)

Chapter 4 - Case Study

Machine Learning on Spark:

  • Further use of the MLlib - gradient boost, random forest, etc.

Interview with Javier Luraschi and Kevin Ushey:

  • Notes on SparkR vs sparklyr

Example code includes:

# In the last chapter, you saw some of the feature transformation functionality of Spark Mllib
# If that library were a meal, the feature transformations would be a starter; the main course is a sumptuous selection of machine learning modeling functions!
# These functions all have names beginning with ml_, and have a similar signature
# They take a tibble, a string naming the response variable, a character vector naming features (input variables), and possibly some other model-specific arguments
# a_tibble %>% ml_some_model("response", c("a_feature", "another_feature"), some_other_args)

# Supported machine learning functions include linear regression and its variants, tree-based models (ml_decision_tree(), and a few others
# You can see the list of all the machine learning functions using ls() - ls("package:sparklyr", pattern = "^ml")

# Songs start out as an analogue thing: their sound is really a load of vibrations of air
# In order to analyze a song, you need to turn it into some meaningful numbers
# Tracks in the Million Song Dataset have twelve timbre measurements taken at regular time intervals throughout the song
# (Timbre is a measure of the perceived quality of a sound; you can use it to distinguish voices from string instruments from percussion instruments, for example.)

# In this chapter, you are going to try and predict the year a track was released, based upon its timbre
# That is, you are going to use these timbre measurements to generate features for the models
# (Recall that feature is machine learning terminology for an input variable in a model. They are often called explanatory variables in statistics.)

# The timbre data takes the form of a matrix, with rows representing the time points, and columns representing the different timbre measurements
# Thus all the timbre matrices have twelve columns, but the number of rows differs from song to song
# The mean of each column estimates the average of a timbre measurement over the whole song
# These can be used to generate twelve features for the model

# timbre has been pre-defined
timbre

# Calculate column means
(mean_timbre <- colMeans(timbre))


# CSV files are great for saving the contents of rectangular data objects (like R data.frames and Spark DataFrames) to disk
# The problem is that they are really slow to read and write, making them unusable for large datasets
# Parquet files provide a higher performance alternative
# As well as being used for Spark data, parquet files can be used with other tools in the Hadoop ecosystem, like Shark, Impala, Hive, and Pig

# Technically speaking, parquet file is a misnomer
# When you store data in parquet format, you actually get a whole directory worth of files
# The data is split across multiple .parquet files, allowing it to be easily stored on multiple machines, and there are some metadata files too, describing the contents of each column

# sparklyr can import parquet files using spark_read_parquet()
# This function takes a Spark connection, a string naming the Spark DataFrame that should be created, and a path to the parquet directory
# Note that this function will import the data directly into Spark, which is typically faster than importing the data into R, then using copy_to() to copy the data from R to Spark

# spark_read_parquet(sc, "a_dataset", "path/to/parquet/dir")

# A Spark connection has been created for you as spark_conn
# A string pointing to the parquet directory (on the file system where R is running) has been created for you as parquet_dir

# parquet_dir has been pre-defined
parquet_dir

# List the files in the parquet dir
filenames <- dir(parquet_dir, full.names=TRUE)

# Show the filenames and their sizes
data_frame(
  filename = basename(filenames),
  size_bytes = file.size(filenames)
)

# Import the data into Spark
timbre_tbl <- spark_read_parquet(spark_conn, "timbre", parquet_dir)


# The features to the models you are about to run are contained in the timbre dataset, but the response – the year – is contained in the track_metadata dataset
# Before you run the model, you are going to have to join these two datasets together
# In this case, there is a one to one matching of rows in the two datasets, so you need an inner join

# There is one more data cleaning task you need to do
# The year column contains integers, but Spark modeling functions require real numbers
# You need to convert the year column to numeric

# A Spark connection has been created for you as spark_conn
# Tibbles attached to the track metadata and timbre data stored in Spark have been pre-defined as track_metadata_tbl and timbre_tbl respectively

# track_metadata_tbl, timbre_tbl pre-defined
track_metadata_tbl
timbre_tbl

track_metadata_tbl %>%
  # Inner join to timbre_tbl
  inner_join(timbre_tbl, by="track_id") %>%
  # Convert year to numeric
  mutate(year = as.numeric(year))

# Before you can run any models, you need to partition your data into training and testing sets
# There's a complication with this dataset, which means you can't just call sdf_partition()
# The complication is that each track by a single artist ought to appear in the same set; your model will appear more accurate than it really is if tracks by an artist are used to train the model then appear in the testing set

# The trick to dealing with this is to partition only the artist IDs, then inner join those partitioned IDs to the original dataset
# Note that artist_id is more reliable than artist_name for partitioning, since some artists use variations on their name between tracks
# For example, Duke Ellington sometimes has an artist name of "Duke Ellington", but other times has an artist name of "Duke Ellington & His Orchestra", or one of several spelling variants

# A Spark connection has been created for you as spark_conn
# A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_tbl

# track_data_tbl has been pre-defined
track_data_tbl

training_testing_artist_ids <- track_data_tbl %>%
  # Select the artist ID
  select(artist_id) %>%
  # Get distinct rows
  distinct() %>%
  # Partition into training/testing sets
  sdf_partition(training = 0.7, testing = 0.3)

track_data_to_model_tbl <- track_data_tbl %>%
  # Inner join to training partition
  inner_join(training_testing_artist_ids$training, by="artist_id")

track_data_to_predict_tbl <- track_data_tbl %>%
  # Inner join to testing partition
  inner_join(training_testing_artist_ids$testing, by="artist_id")


# Gradient boosting is a technique to improve the performance of other models
# The idea is that you run a weak but easy to calculate model
# Then you replace the response values with the residuals from that model, and fit another model
# By "adding" the original response prediction model and the new residual prediction model, you get a more accurate model
# You can repeat this process over and over, running new models to predict the residuals of the previous models, and adding the results in
# With each iteration, the model becomes stronger and stronger.

# To give a more concrete example, sparklyr uses gradient boosted trees, which means gradient boosting with decision trees as the weak-but-easy-to-calculate model
# These can be used for both classification problems (where the response variable is categorical) and regression problems (where the response variable is continuous)
# In the regression case, as you'll be using here, the measure of how badly a point was fitted is the residual

# Decision trees are covered in more depth in the Supervised Learning in R: Classification, and Supervised Learning in R: Regression courses (coming Summer 2017)
# The latter course also covers gradient boosting

# To run a gradient boosted trees model in sparklyr, call ml_gradient_boosted_trees()
# Usage for this function was discussed in the first exercise of this chapter

# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl

feature_colnames <- track_data_to_model_tbl %>%
  # Get the column names
  colnames() %>%
  # Limit to the timbre columns
  str_subset(fixed("timbre"))

gradient_boosted_trees_model <- track_data_to_model_tbl %>%
  # Run the gradient boosted trees model
  ml_gradient_boosted_trees("year", feature_colnames)


# Once you've run your model, then the next step is to make a prediction with it
# sparklyr contains methods for the predict() function from base-R
# This means that you can make predictions from Spark models with the same syntax as you would use for predicting a linear regression
# predict() takes two arguments: a model, and some testing data - predict(a_model, testing_data)

# A common use case is to compare the predicted responses with the actual responses, which you can draw plots of in R
# The code pattern for preparing this data is as follows
# Note that currently adding a prediction column has to be done locally, so you must collect the results first
# predicted_vs_actual <- testing_data %>%
#   select(response) %>%
#   collect() %>%
#   mutate(predicted_response = predict(a_model, testing_data))

# A Spark connection has been created for you as spark_conn
# Tibbles attached to the training and testing datasets stored in Spark have been pre-defined as track_data_to_model_tbl and track_data_to_predict_tbl respectively
# The gradient boosted trees model has been pre-defined as gradient_boosted_trees_model

# training, testing sets & model are pre-defined
track_data_to_model_tbl
track_data_to_predict_tbl
gradient_boosted_trees_model

responses <- track_data_to_predict_tbl %>%
  # Select the year column
  select(year) %>%
  # Collect the results
  collect() %>%
  # Add in the predictions
  mutate(
    predicted_year = predict(
      gradient_boosted_trees_model,
      newdata=track_data_to_predict_tbl
    )
  )


# Now you have your model predictions, you might wonder "are they any good?"
# There are many plots that you can draw to diagnose the accuracy of your predictions; here you'll take a look at two common plots
# Firstly, it's nice to draw a scatterplot of the predicted response versus the actual response, to see how they compare
# Secondly, the residuals ought to be somewhere close to a normal distribution, so it's useful to draw a density plot of the residuals

# One slightly tricky thing here is that sparklyr doesn't yet support the residuals() function in all its machine learning models
# Consequently, you have to calculate the residuals yourself (predicted responses minus actual responses)

# A local tibble responses, containing predicted and actual years, has been pre-defined

# responses has been pre-defined
responses

# Draw a scatterplot of predicted vs. actual
ggplot(responses, aes(actual, predicted)) +
  # Add the points
  geom_point(alpha=0.1) +
  # Add a line at actual = predicted
  geom_abline(intercept=0, slope=1)

residuals <- responses %>%
  # Transmute response data to residuals
  transmute(residual = predicted - actual)

# Draw a density plot of residuals
ggplot(residuals, aes(residual)) +
    # Add a density curve
    geom_density() +
    # Add a vertical line through zero
    geom_vline(xintercept=0)



# Like gradient boosted trees, random forests are another form of ensemble model
# That is, they use lots of simpler models (decision trees, again) and combine them to make a single better model
# Rather than running the same model iteratively, random forests run lots of separate models in parallel, each on a randomly chosen subset of the data, with a randomly chosen subset of features
# Then the final decision tree makes predictions by aggregating the results from the individual models

# sparklyr's random forest function is called ml_random_forest()
# Its usage is exactly the same as ml_gradient_boosted_trees() (see the first exercise of this chapter for a reminder on syntax)

# A Spark connection has been created for you as spark_conn
# A tibble attached to the combined and filtered track metadata/timbre data stored in Spark has been pre-defined as track_data_to_model_tbl

# track_data_to_model_tbl has been pre-defined
track_data_to_model_tbl

# Get the timbre columns
feature_colnames <- track_data_to_model_tbl %>%
  colnames() %>%
  str_subset(fixed("timbre"))

# Run the random forest model
random_forest_model <- track_data_to_model_tbl %>%
  ml_random_forest("year", feature_colnames)


# training, testing sets & model are pre-defined
track_data_to_model_tbl
track_data_to_predict_tbl
random_forest_model

# Create a response vs. actual dataset
responses <- track_data_to_predict_tbl %>%
  select(year) %>%
  collect() %>%
  mutate(predicted_year = predict(random_forest_model, newdata=track_data_to_predict_tbl))


# both_responses has been pre-defined
both_responses

# Draw a scatterplot of predicted vs. actual
ggplot(both_responses, aes(actual, predicted, color=model)) +
  # Add a smoothed line
  geom_smooth() +
  # Add a line at actual = predicted
  geom_abline(intercept = 0, slope = 1)

# Create a tibble of residuals
residuals <- both_responses %>%
  mutate(residual = predicted - actual)

# Draw a density plot of residuals
ggplot(residuals, aes(residual, color=model)) +
    # Add a density curve
    geom_density() +
    # Add a vertical line through zero
    geom_vline(xintercept = 0)


# both_responses has been pre-defined
both_responses

# Create a residual sum of squares dataset
both_responses %>%
  mutate(residual = predicted - actual) %>%
  group_by(model) %>%
  summarize(rmse = sqrt(mean(residual ** 2)))

Building Web Applications in R with Shiny: Case Studies

Chapter 1 - Shiny Review

Introduction:

  • Course assumes some basic knowledge of Shiny
    • Will begin with basic review of Shiny concepts
    • Continues with developing applications, and learning new skills and best practices
  • The most basic outline of all Shiny applications includes library(shiny) followed by:
    • ui <- fluidPage() # The UI defines the appearance and user-controls for the app
    • server <- function(input, output) {} # The server is the brains of the app, implementing the logic
    • shinyApp(ui=ui, server=server) # Combine the UI and server, then run it
  • For basic text, it can be just added inside the fluidPage() function call
    • ui <- fluidPage(“hello there”)
    • The fluidPage() will accept an arbitrary number of arguments, all separated by commas
  • Shiny has functions for helping to format text
    • h1() for primary, strong() for bold, em() for italicized
    • fluidPage(h1(“Shiny”), “by”, em(“instructor”)) # will have Shiny as an h1-heading and instructor italicized
  • Shiny has multiple layout options, with Sidebar being the most common (controls on the left, results on the right)
    • sidebarLayout(sidebarPanel(“Things on the side), mainPanel(”Things for the main panel))

Inputs and outputs:

  • Inputs are ways for users to interact with the Shiny App - mouse, keyboard, drop-downs, check-boxes, etc.
    • Shiny has many built-in input functions, typically of the formal *Input(inputId=, label=, …)
    • The inputId= needs to be a string that is unique, while the label is the descriptive text that will be placed above the input object
  • Outputs are anything that the user will be able to see, such as plots, tables, and text
    • Add the plot in the appropriate area of the UI as plotOutput(outputId=), where outputId must be a unique string
    • Add commands to the server function (which has two list arguments, input and output) to create and/or update the output
  • There are three rules for building the output of any Shiny process
    • The code needs to be insider one of the render() functions, where is appropriate to the request - e.g., renderPlot for plots
    • Save object to output$
    • Use input$ if you want to access anything from the input list - updates occur automatically

Reactivity 101:

  • Shiny uses reactive programming, which means that any time a variable changes, then anything that depends on that variable will also be updated
    • y <- 5 ; x <- y + 1 ; y <- 10 # under reactive programming for x and y, then x will now be 11 and y will be 10
  • In Shiny, all input variables are assumed to be reactive, so anything that depends on the inputs will be using the most up-to-date values
    • Reactive variables can only be used inside reactive contexts (all render* functions are reactive contexts, but using the input$ commands outside a reactive context generates an error)
  • It is also possible to observe a reactive variable using observe({}), which is itself a reactive context
    • So, to know what input\(a is, can run observe({ print(input\)a) })
    • As such, observe can be a very valuable tool for debugging
    • The observe function will run EVERY time any of the reactives inside of it have been updated
  • Can also create a reactive variable using reactive({ })
    • x <- reactive({ input\(a + 1 }) # will create reactive variable x to be input\)a +1 and to update any time a updates
    • The new value needs to be called like a function - x() - if it is used any time later in the program

Example code includes:

# Load the shiny package
library(shiny)

# Define UI for the application
ui <- fluidPage(
  # Add the text "Shiny is fun"
  "Shiny is fun"
)

# Define the server logic
server <- function(input, output) {}

# Run the application
shinyApp(ui = ui, server = server)


# Shiny has many functions that can transform plain text into formatted text
# Simply place text inside the h1() function to create a primary header (e.g. a title), h2() for a secondary header, strong() to make text bold, em() to make text italicized, or any of the other formatting functions

# You can also intermingle plain text and formatted text as much as you'd like—just remember to separate all the elements with commas!

# Define UI for the application
ui <- fluidPage(
  # "DataCamp" as a primary header
  h1("DataCamp"),
  # "Shiny use cases course" as a secondary header
  h2("Shiny use cases course"),
  # "Shiny" in italics
  em("Shiny"),
  # "is fun" as bold text
  strong("is fun")
)

# Define the server logic
server <- function(input, output) {}

# Run the application
shinyApp(ui = ui, server = server)


# Layouts in Shiny are used to give your app some structure by placing elements in certain desired positions

# A sidebar layout, created with the sidebarLayout() function, provides a basic two-column structure with a smaller sidebar on the left and a larger main panel on the right

# The sidebar layout function takes two arguments: sidebarPanel() and mainPanel()
# Each of these panels can contain any arbitrary mix of text/HTML elements, in a similar fashion to how you can mix these elements inside a fluidPage()

cars <- data.frame(speed=c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25), dist=c(2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85))
str(cars)

# Define UI for the application
ui <- fluidPage(
  # Add a sidebar layout to the application
  sidebarLayout(
    # Add a sidebar panel around the text and inputs
    sidebarPanel(
      h4("Plot parameters"),
      textInput("title", "Plot title", "Car speed vs distance to stop"),
      numericInput("num", "Number of cars to show", 30, 1, nrow(cars)),
      sliderInput("size", "Point size", 1, 5, 2, 0.5)
    ),
    # Add a main panel around the plot and table
    mainPanel(
      plotOutput("plot"),
      tableOutput("table")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    plot(cars[1:input$num, ], main = input$title, cex = input$size)
  })
  output$table <- renderTable({
    cars[1:input$num, ]
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Inputs are Shiny's way of allowing users to interact with an app. For example, textInput() is used to let the user enter text and numericInput() lets the user select a number
# In the next chapter we will see many other types of inputs

# To add an input to your app, simply add the input function inside fluidPage()
# Recall from the video that all input functions have the same first two arguments: inputId and label

# Define UI for the application
ui <- fluidPage(
  # Create a numeric input with ID "age" and label of
  # "How old are you?"
  numericInput(inputId="age", label="How old are you?", value = 20),
  
  # Create a text input with ID "name" and label of 
  # "What is your name?"
  textInput(inputId="name", label="What is your name?")
)

# Define the server logic
server <- function(input, output) {}

# Run the application
shinyApp(ui = ui, server = server)


# Outputs are any object that should be displayed to the user and is generated in R, such as a plot or a table

# To add an output to a Shiny app, the first thing you need to do is add a placeholder for the output that tells Shiny where to place the output

# There are several output placeholder functions provided by Shiny, one for each type of output
# For example, plotOutput() is for displaying plots, tableOutput() is for outputting tables, and textOutput() is for dynamic text

data(iris)
str(iris)

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      # Create a text input with an ID of "name"
      textInput(inputId="name", "What is your name?", "Dean"),
      numericInput("num", "Number of flowers to show data for",
                   10, 1, nrow(iris))
    ),
    mainPanel(
      # Add a placeholder for a text output with ID "greeting"
      textOutput(outputId="greeting"),
      # Add a placeholder for a plot with ID "cars_plot"
      plotOutput(outputId="cars_plot"),
      # Add a placeholder for a table with ID "iris_table"
      tableOutput(outputId="iris_table")
    )
  )
)

# Define the server logic
server <- function(input, output) {}

# Run the application
shinyApp(ui = ui, server = server)


# There are three rules to build an output in Shiny:
# Build the object with the appropriate render*() function.
# Save the result of the render function into the output list, which is a parameter of the server function. Specifically, save it into output$<outputId> in order to replace the output placeholder in the UI that has ID outputId
# If the output relies on any user-modified input values, you can access any of the inputs using the input parameter of the server function. Specifically, input$<inputId> will always return the current value of the input field that has ID inputId

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("name", "What is your name?", "Dean"),
      numericInput("num", "Number of flowers to show data for",
                   10, 1, nrow(iris))
    ),
    mainPanel(
      textOutput("greeting"),
      plotOutput("cars_plot"),
      tableOutput("iris_table")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  # Create a plot of the "cars" dataset 
  output$cars_plot <- renderPlot({
    plot(cars)
  })
  
  # Render a text greeting as "Hello <name>"
  output$greeting <- renderText({
    paste("Hello", input$name)
  })
  
  # Show a table of the first n rows of the "iris" data
  output$iris_table <- renderTable({
    data <- iris[1:input$num, ]
    data
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Reactive values are special constructs in Shiny; they are not seen anywhere else in R programming
# As such, they cannot be used in just any R code, reactive values can only be accessed within a reactive context

# This is the reason why any variable that depends on a reactive value must be created using the reactive() function, otherwise you will get an error
# The shiny server itself is not a reactive context, but the reactive() function, the observe() function, and all render*() functions are

# You are provided with a Shiny app containing two numeric inputs, num1 and num2, and a text output. Your task is to:

ui <- fluidPage(
  numericInput("num1", "Number 1", 5),
  numericInput("num2", "Number 2", 10),
  textOutput("result")
)

server <- function(input, output) {
  # Calculate the sum of the inputs
  sum <- reactive({
    input$num1 + input$num2
  })

  # Calculate the average of the inputs
  average <- reactive({
    sum() / 2
  })
  
  output$result <- renderText({
    paste(
      # Print the calculated sum
      "The sum is", sum(),
      # Print the calculated average
      "and the average is", average()
    )
  })
}

shinyApp(ui, server)

Chapter 2 - Making good plots with Shiny

Gapminder dataset:

  • Shiny can be used to customize a plot, based on the interactive user interface for experimentation
  • The Gapminder data contains socioeconomic data for ~140 countries for 1952-2007
    • Each row is one country for one year, with continent, life expectancy, population, and GDP per capita

Adding simple inputs to modify a plot:

  • The course will use ggplot2 for plotting inside the renderPlot() functions
  • Goal will be to customize the title, size of the data points, and yes/no for adding a best fit line
    • textInput() for getting the user input for the title - character
    • numericInput() for getting the user input for the point sizes - integer or numeric
    • checkboxInput() for getting the user input for a yes/no or TRUE/FALSE or the like - boolean

More input types:

  • Can add sliderInput() to allow the user to add a number based on a slider
    • Similar to numericInput(), have min= and max= which are both required
    • Can set the value= to be a vector of two numbers such as (10, 15), which will then make it a range selector
  • Can add radioButtons(inputId=, label=, choices=, selected=) to allow the user to select one item from a selection
    • There is no value= argument for radiuoButtons()
    • Instead, the selected= argument is applied within the function
    • The choices= provides a vector of the vaues for the various radio buttons
  • Can add selectInput(inputId=, label=, choices=, selected=, multiple=) which is a variant of the drop-down functionality
    • Similar to the radioButtons() idea for arguments when multiple=FALSE
    • If multiple=TRUE, then the user can select 2+ choices from the drop-down
  • Radio buttons and drop downs can frequently be inter-changed, though with some cautions
    • Radio buttons are sub-optimal when there are many potential selections due to space-constraints
    • Radio buttons can be preferred because the user can always see all of their options at once
    • Drop downs are required if the user needs to be able to select 2+ items at once

Advanced features to improve your plot:

  • There is a library(colourpicker) that is available to help with color selection in Shiny apps
  • Can run colourInput(inputId=, label=, value=)
    • Users can select any possible colour, without any pre-defined limits
  • Output functions (the portion included in the UI) can also have arguments, though these are less common
    • For example, plotOutput() can have width= or height= or hover= or the like
    • See the documentation for all of the potential arguments that could be included
    • The width and height parameters can be helpful to get a good aspect ratio for the graph
  • There are many packages for interactive plots in R, including plotly
    • plotly::ggplotly() will turn any ggplot2 plot in to a plotly interactive plot
    • With plotly, can zoom and hover and the like
    • Plots that are generated with plotly cannot just be generated inside renderPlot()
  • When using a plotly plot, make a few key changes to the UI and server
    • plotlyOutput() in the UI
    • renderPlotly() in the server

Example code includes:

# To use the gapminder data in a Shiny app, you will often have to filter the dataset in order to retain only a subset of the rows
# You can use the subset() function for that.

# Load the gapminder package
library(gapminder)
library(shiny)

# Define UI for the application
ui <- fluidPage(
  "The population of France in 1972 was",
  textOutput("answer")
)

# Define the server function
server <- function(input, output) {
  output$answer <- renderText({
    # Determine the population of France in year 1972
    subset(gapminder, country=="France" & year==1972)$pop
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# In Shiny, as soon as the user changes the value of any input, Shiny makes the current value of that input immediately available to you in the server through the input argument of the server function
# You can retrieve the value of any input using input$<inputId>

# In order to assign a default initial value to a text input, the value argument is used

# The given Shiny app plots the GDP per capita vs life expectancy of countries in the gapminder dataset
# Your task is to add a text input that lets users change the title of the plot. Specifically:

# Load the ggplot2 package for plotting
# library(ggplot2)

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      # Add a title text input
      textInput(inputId="title", label="Title", value="GDP vs life exp")
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    ggplot(gapminder, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10() +
      # Use the input value as the plot's title
      ggtitle(input$title)
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Numeric inputs have a few more arguments that text inputs do not have, such as min and max, which define the minimum and maximum numbers that can be chosen

# Note that when the value of an input is accessed in the server code, Shiny is smart enough to know what type of input was used, and therefore what type of object it should return
# This means that if you have a numeric input with ID "foo", then input$foo will return a numeric value

# The code for the Shiny app from the last exercise is provided
# Your task is to add a numeric input that the user can use to change the size of the points on the plot

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      # Add a size numeric input
      numericInput(inputId="size", label="Point size", value=1, min=1)
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    ggplot(gapminder, aes(gdpPercap, lifeExp)) +
      # Use the size input as the plot point size
      geom_point(size = input$size) +
      scale_x_log10() +
      ggtitle(input$title)
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Unlike text and numeric inputs, checkbox inputs are limited to only two possible values: TRUE or FALSE
# When the user checks a checkbox input, the input has a value of TRUE, and if the box is unchecked then it returns FALSE

# Note that the value parameter of the checkboxInput() function, which defines the initial value, can only be set to either TRUE or FALSE

# The code for the Shiny app from the last exercise is provided with some modification
# The ggplot plot object inside renderPlot() is now assigned to a variable p

# Your task is to add a checkbox input that, when checked, will add a line of best fit to the plot

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      # Add a checkbox for line of best fit
      checkboxInput(inputId="fit", label="Add line of best fit", value=FALSE)
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    p <- ggplot(gapminder, aes(gdpPercap, lifeExp)) +
      geom_point(size = input$size) +
      scale_x_log10() +
      ggtitle(input$title)
    
    # When the "fit" checkbox is checked, add a line
    # of best fit
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Radio buttons are used when you want to present the user with several options and ask them to choose one
# They have a choices parameter that defines the different options the user can choose from, and a selected argument that defines which choice is selected initially
# Note that there is no value parameter, though you can think of selected as having a similar role

# The code for the Shiny app from the last exercise is provided
# Your task is to add radio buttons that give the user a choice of colour to use for the plot

# Define UI for the application
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      checkboxInput("fit", "Add line of best fit", FALSE),
      # Add radio buttons for colour
      radioButtons(inputId="colour", label="Point colour", choices=c("blue", "red", "green", "black"))
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    p <- ggplot(gapminder, aes(gdpPercap, lifeExp)) +
      # Use the value of the colour input as the point colour
      geom_point(size = input$size, col = input$colour) +
      scale_x_log10() +
      ggtitle(input$title)
    
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# When there are many options to let the user choose from, radio buttons can take up a lot of space and may not be ideal
# Select inputs—also called 'dropdown lists'—can also be used to ask the user to choose an option from a list of choices, but in a more compact way
# With a select input, all the options appear in a scrollable list, so it can be used even if you have many choices

# Similar to radio buttons, select inputs also have choices and selected parameters
# Additionally, select inputs have a multiple argument, which, when set to TRUE, allows the user to select more than one value

# The code for the Shiny app from the last exercise is provided with slight modifications

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      checkboxInput("fit", "Add line of best fit", FALSE),
      radioButtons("colour", "Point colour",
                   choices = c("blue", "red", "green", "black")),
      # Add a continent dropdown selector
      selectInput(inputId="continents", label="Continents", choices=levels(gapminder$continent), selected="Europe", multiple=TRUE)
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    # Subset the gapminder dataset by the chosen continents
    data <- subset(gapminder,
                   continent %in% input$continents)

    p <- ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point(size = input$size, col = input$colour) +
      scale_x_log10() +
      ggtitle(input$title)
    
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

shinyApp(ui = ui, server = server)


# Slider inputs can be used for similar purposes to numeric inputs, as they both provide the user with a way to select a number

# If the initial provided value (the value argument) of the slider is a single number, then the slider will be used to select single numbers
# However, if the initial value is a vector of two numbers, then the slider will be used to select two numbers instead of just a single value

# We have already seen that different inputs may have different arguments
# It can be difficult to remember the exact arguments each input uses
# The only way to find out what arguments you can use with a specific input function is by looking at its documentation or help file

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      checkboxInput("fit", "Add line of best fit", FALSE),
      radioButtons("colour", "Point colour",
                   choices = c("blue", "red", "green", "black")),
      selectInput("continents", "Continents",
                  choices = levels(gapminder$continent),
                  multiple = TRUE,
                  selected = "Europe"),
      # Add a slider selector for years to filter
      sliderInput(inputId="years", label="Years", min=min(gapminder$year), max=max(gapminder$year), value=c(1977, 2002))
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    # Subset the gapminder data by the chosen years
    data <- subset(gapminder,
                   continent %in% input$continents &
                   year >= input$years[1] & year <= input$years[2])
    
    p <- ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point(size = input$size, col = input$colour) +
      scale_x_log10() +
      ggtitle(input$title)
    
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

shinyApp(ui = ui, server = server)


# The colourpicker package provides a colour input, available through the colourInput() function
# Even though colour inputs are not part of the shiny package, they behave in the same way as any other input

# A colour input can have many different arguments you can explore, but we will only use the basic arguments: inputId, label, and value
# The value argument accepts a colour to use as the initial value
# Colours can be specified in several different formats, but the easiest one is to simply use English colour names such as "red" or "yellow"

# The code for the Shiny app from the last exercise is provided
# Your task is to replace the radio buttons that are used to select a colour with a colour input

# Load the colourpicker package
library(colourpicker)

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      checkboxInput("fit", "Add line of best fit", FALSE),

      # Replace the radio buttons with a colour input
      colourInput("colour", "Point colour", value="blue"), 
      selectInput("continents", "Continents",
                  choices = levels(gapminder$continent),
                  multiple = TRUE,
                  selected = "Europe"),
      sliderInput("years", "Years",
                  min(gapminder$year), max(gapminder$year),
                  value = c(1977, 2002))
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    data <- subset(gapminder,
                   continent %in% input$continents &
                   year >= input$years[1] & year <= input$years[2])
    
    p <- ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point(size = input$size, col = input$colour) +
      scale_x_log10() +
      ggtitle(input$title)
    
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

shinyApp(ui = ui, server = server)


# Just as input functions can have different arguments depending on the type of input, so can output placeholder functions have different arguments to modify their appearance or behaviour

# For example, when displaying a plot in a Shiny app using plotOutput(), the height of the plot by default will be 400 pixels
# The plotOutput() function has some parameters that can be used to modify the height or width of a plot

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      textInput("title", "Title", "GDP vs life exp"),
      numericInput("size", "Point size", 1, 1),
      checkboxInput("fit", "Add line of best fit", FALSE),
      colourInput("colour", "Point colour", value = "blue"),
      selectInput("continents", "Continents",
                  choices = levels(gapminder$continent),
                  multiple = TRUE,
                  selected = "Europe"),
      sliderInput("years", "Years",
                  min(gapminder$year), max(gapminder$year),
                  value = c(1977, 2002))
    ),
    mainPanel(
      # Make the plot 600 pixels wide and 600 pixels tall
      plotOutput("plot", width=600, height=600)
    )
  )
)

# Define the server logic
server <- function(input, output) {
  output$plot <- renderPlot({
    data <- subset(gapminder,
                   continent %in% input$continents &
                     year >= input$years[1] & year <= input$years[2])
    
    p <- ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point(size = input$size, col = input$colour) +
      scale_x_log10() +
      ggtitle(input$title)
    
    if (input$fit) {
      p <- p + geom_smooth(method = "lm")
    }
    p
  })
}

shinyApp(ui = ui, server = server)


# plotly is a popular package for creating interactive plots in Shiny
# There are several other packages for interactive visualizations, but we will use plotly largely because of its function ggplotly(), which converts a ggplot2 plot into an interactive one

# The code for the Shiny app from the last exercise is provided. Your task is to replace the ggplot2 plot with a plotly plot

# Load the plotly package
# library(plotly)
# 
# ui <- fluidPage(
#   sidebarLayout(
#     sidebarPanel(
#       textInput("title", "Title", "GDP vs life exp"),
#       numericInput("size", "Point size", 1, 1),
#       checkboxInput("fit", "Add line of best fit", FALSE),
#       colourInput("colour", "Point colour", value = "blue"),
#       selectInput("continents", "Continents",
#                   choices = levels(gapminder$continent),
#                   multiple = TRUE,
#                   selected = "Europe"),
#       sliderInput("years", "Years",
#                   min(gapminder$year), max(gapminder$year),
#                   value = c(1977, 2002))
#     ),
#     mainPanel(
#       # Replace the `plotOutput()` with the plotly version
#       plotlyOutput("plot")
#     )
#   )
# )
# 
# # Define the server logic
# server <- function(input, output) {
#   # Replace the `renderPlot()` with the plotly version
#   output$plot <- renderPlotly({
#     # Convert the existing ggplot2 to a plotly plot
#     ggplotly({
#       data <- subset(gapminder,
#                      continent %in% input$continents &
#                        year >= input$years[1] & year <= input$years[2])
#       
#       p <- ggplot(data, aes(gdpPercap, lifeExp)) +
#         geom_point(size = input$size, col = input$colour) +
#         scale_x_log10() +
#         ggtitle(input$title)
#       
#       if (input$fit) {
#         p <- p + geom_smooth(method = "lm")
#       }
#       p
#     })
#   })
# }
# 
# shinyApp(ui = ui, server = server)

Chapter 3 - Explore datasets interactively with Shiny

Explore a dataset with Shiny:

  • Goal is to expand on interactivity by being able to explore, interact with, filter, and download the asociated data
  • Tables in Shiny are outputs, created using a two-step process
    • The UI needs a tableOutput()
    • The server needs a renderTable()
  • The choices= argument of selectInput() takes a list string, and the coder can control what is available
    • Could pass only a select subset of countries or years as options, preventing users from taking unplanned actions

More ways to view data: plot and download:

  • Plots are a common and valuable first step when viewing the data
  • Plots are outputs meaning that they need to be inside the UI and modified by way of the server function
    • plotOutput(outputId=)
    • renderPlot()
  • Downloading is supporting in Shiny by way of the download button - can be used to download an image, text file, CSV file, etc.
    • downloadButton(outputId=, label=) needs to be in the UI
    • downloadHandler(filename=, content=) needs to be in the server, with filename being a string and contect being a function

Reactive variables:

  • Code duplication can become annoying and hard to maintain; often, the same trasnformations are desired in many places
  • Using reactive() can be a very good solution to minimizing code duplication
    • my_data <- reactive({ })
    • Needs to then be called as my_data()
  • Reactive variables cache their variables, meaning they store their values and update only when one of their input components is updated
    • By calculating a reactive variable, run time can be better optimized through re-use
  • Reactive variables are also lazy, meaning that they are only run when their values are needed
    • If the reactive variable is only needed in the download function, then it will only update on download, not every time the user changes something in the reactive

Visual enhancements:

  • Aesthetic and ease-of-use are both core to the success of a Shiny App
  • Tables by default show everything, which may be OK for small tables but becomes cumbersome otherwise
  • The DT library allows for making better tables, largely as a “drop-in” replacement for existing commands
    • DT::dataTableOutput(outputId=)
    • DT::renderDataTable({ })
    • Allows for sorting by column, searching, pagination of long results, and the like
  • Breaking the UI into several tabs can be very helpful for a large number of inputs and outputs
    • tabPanel(title=, , )
    • All of the tabPanel() calls should be placed inside a tabsetPanel() function
  • CSS (cascading style sheets) is the mark-up language that can help improve the look of a Shiny app (which is just a html web page)
    • CSS is written as a collection of rules - #ID { property: value; property:value; } - the ID should match the element name (inputId or outputId) that you want to modify
    • The CSS code then goes inside a tags$style(“quotedCSSCall”)

Example code includes:

# An easy first step in exploring a dataset is to simply view it as a table

# So far we have focused mostly on inputs—interactive widgets that allow the user to select values
# Now we want to have a table in our app, and send data to display in the table
# To display objects in Shiny, we need to use output and render functions

# Given a minimal Shiny app, add a table that will show the gapminder dataset

ui <- fluidPage(
  h1("Gapminder"),
  # Add a placeholder for a table output
  tableOutput(outputId="table")
)

server <- function(input, output) {
  # Call the appropriate render function 
  output$table <- renderTable({
    # Show the gapminder object in the table
    gapminder
  })
}

shinyApp(ui, server)


# The real benefit of using Shiny comes when inputs are combined with outputs
# The table created in the last exercise is static—it cannot be changed—but for exploration, it would be better if the user could decide what subset of the data to see

# This can be achieved by adding an input that lets the user select a value to filter the data
# This way, the table we created in the previous exercise can be made dynamic

# One of the variables in the gapminder dataset is lifeExp (life expectancy)
# Your task is to add a slider input to the Shiny app that lets the user choose a minimum and maximum life expectancy, and the table will only show data that matches these values

ui <- fluidPage(
  h1("Gapminder"),
  # Add a slider for life expentancy filter 
  sliderInput(inputId = "life", label = "Life expectancy",
      min = 0, max = 120,
      value = c(30, 50)),
  tableOutput("table")
)

server <- function(input, output) {
  output$table <- renderTable({
    data <- gapminder
    data <- subset(
      data,
      # Use the life expentancy input to filter the data
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    data
  })
}

shinyApp(ui, server)


# When exploring a dataset, it is often useful to experiment with filtering more than one variable
# For example, you might be interested in only seeing data for African countries that had a specific life expentancy

# Add a select input that allows the user to select a specific continent to view

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  # Add a continent selector dropdown
  selectInput(inputId="continent", label="Continent", choices = levels(gapminder$continent)),
  tableOutput("table")
)

server <- function(input, output) {
  output$table <- renderTable({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    data <- subset(
      data,
      # Filter the data according to the continent input value
      continent == input$continent
    )
    data
  })
}

shinyApp(ui, server)


# Before adding the continent selector, the Shiny app showed data for all continents
# Now that the continent selector was added, the data can be viewed per continent
# But what if the user decides they actually don't want to filter for a specific continent, and they prefer to see all of them?
# Unfortunately, adding the continent selector removed that ability

# The choices argument of the selectInput() function can be modified to add another value to the continent list, and when this value is chosen, continent filtering can be turned off

# Add an option in the select input to select "All" continents
# When that option is selected, do not perform any continent filtering

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  # Add an "All" value to the continent list
  selectInput("continent", "Continent",
              choices = c("All", levels(gapminder$continent))),
  tableOutput("table")
)

server <- function(input, output) {
  output$table <- renderTable({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    # Don't subset the data if "All" continent are chosen
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })
}

shinyApp(ui, server)


# Recall that plots are output objects, and as such they are added to a Shiny app using the plotOutput() + renderPlot() functions
# The output function is added to the UI to determine where to place the plot, and the render function in the server code is responsible for generating the plot

# Your task is to add a plot of GDP per capita vs life expectancy to the app
# The data used in the plot should be the same data that is shown in the table; that is, the data in the plot should only show records that match the input filters

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  selectInput("continent", "Continent",
              choices = c("All", levels(gapminder$continent))),
  # Add a plot output
  plotOutput("plot"),
  tableOutput("table")
)

server <- function(input, output) {
  output$table <- renderTable({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })

  # Create the plot render function  
  output$plot <- renderPlot({
    # Use the same filtered data that the table uses
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)


# Downloading files is achieved using the pair of functions downloadButton() and downloadHandler()
# These two functions pair together similarly to how output and render functions are paired: downloadButton() determines where in the UI it will show up, while downloadHandler() needs to be saved into the output list and has the actual R code to create the downloaded file

# Add the ability to download the data that is currently viewed in the table as a CSV file

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  selectInput("continent", "Continent",
              choices = c("All", levels(gapminder$continent))),
  # Add a download button
  downloadButton(outputId = "download_data", label = "Download"),
  plotOutput("plot"),
  tableOutput("table")
)

server <- function(input, output) {
  output$table <- renderTable({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })

  # Create a download handler
  output$download_data <- downloadHandler(
    # The downloaded file is named "gapminder_data.csv"
    filename =  "gapminder_data.csv",
    content = function(file) {
      # The code for filtering the data is copied from the
      # renderTable() function
      data <- gapminder
      data <- subset(
        data,
        lifeExp >= input$life[1] & lifeExp <= input$life[2]
      )
      if (input$continent != "All") {
        data <- subset(
          data,
          continent == input$continent
        )
      }
      
      # Write the filtered data into a CSV file
      write.csv(data, file, row.names = FALSE)
    }
  )

  output$plot <- renderPlot({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)


# In the previous exercises, the code to filter gapminder according to the input values is duplicated three times: once in the table, once in the plot, and once in the download handler

# Reactive variables can be used to reduce code duplication, which is generally a good idea because it makes maintenance easier

# The duplicated code chunks that filter the data have been removed
# Your task is to add a reactive variable that filters the data, and use this variable instead

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  selectInput("continent", "Continent",
              choices = c("All", levels(gapminder$continent))),
  downloadButton(outputId = "download_data", label = "Download"),
  plotOutput("plot"),
  tableOutput("table")
)

server <- function(input, output) {
  # Create a reactive variable named "filtered_data"
  filtered_data <- reactive({
    # Filter the data (copied from previous exercise)
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })
  
  output$table <- renderTable({
    # Use the filtered_data variable to render the table output
    data <- filtered_data()
    data
  })

  output$download_data <- downloadHandler(
    filename = "gapminder_data.csv",
    content = function(file) {
      # Use the filtered_data variable to create the data for
      # the downloaded file
      data <- filtered_data()
      write.csv(data, file, row.names = FALSE)
    }
  )

  output$plot <- renderPlot({
    # Use the filtered_data variable to create the data for
    # the plot
    data <- filtered_data()
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)


# Datatables from the DT package are often a better way to display data in a Shiny app when compared to the built-in tables
# Shiny tables can be converted to datatables with two simple code modifications: instead of using tableOutput() and renderTable(), you use DT::dataTableOutput() and DT::renderDataTable()
# Datatables have a wide variety of customization options, but we will not be using any special options

# Note that with the DT package, the convention is to not load the DT package, and instead use the DT:: prefix when calling the datatable functions

# The code for the Shiny app from the last coding exercise is provided without any modifications. Your task is to replace the basic Shiny table with a DT table

ui <- fluidPage(
  h1("Gapminder"),
  sliderInput(inputId = "life", label = "Life expectancy",
              min = 0, max = 120,
              value = c(30, 50)),
  selectInput("continent", "Continent",
              choices = c("All", levels(gapminder$continent))),
  downloadButton("download_data"),
  plotOutput("plot"),
  # Replace the tableOutput() with DT's version
  DT::dataTableOutput("table")
)

server <- function(input, output) {
  filtered_data <- reactive({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })
  
  # Replace the renderTable() with DT's version
  output$table <- DT::renderDataTable({
    data <- filtered_data()
    data
  })

  output$download_data <- downloadHandler(
    filename = "gapminder_data.csv",
    content = function(file) {
      data <- filtered_data()
      write.csv(data, file, row.names = FALSE)
    }
  )

  output$plot <- renderPlot({
    data <- filtered_data()
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)


# Tabs are useful when you have too much content and want to split it up
# To create a tab, you simply wrap UI elements in the tabPanel() function, and you need to supply a title for the tab using the title argument

# In order for tabs to appear in the UI, the tab panels need to be grouped into a tabset "container", by wrapping all the tab panels inside tabsetPanel()

# Your task is to add tabs to the Shiny app, such that the inputs and download button are in one tab, the plot is in another tab, and the table is in a third tab
# Since this is purely a visual change, all the code changes are to be done in the UI portion only

ui <- fluidPage(
    h1("Gapminder"),
    # Create a container for tab panels
    tabsetPanel(
        # Create an "Inputs" tab
        tabPanel(
            title = "Inputs",
            sliderInput(inputId = "life", label = "Life expectancy",
                        min = 0, max = 120,
                        value = c(30, 50)),
            selectInput("continent", "Continent",
                        choices = c("All", levels(gapminder$continent))),
            downloadButton("download_data")
        ),
        # Create a "Plot" tab
        tabPanel(
            title = "Plot",
            plotOutput("plot")
        ),
        # Create "Table" tab
        tabPanel(
            title = "Table",
            DT::dataTableOutput("table")
        )
    )
)

server <- function(input, output) {
  filtered_data <- reactive({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })
  
  output$table <- DT::renderDataTable({
    data <- filtered_data()
    data
  })

  output$download_data <- downloadHandler(
    filename = "gapminder_data.csv",
    content = function(file) {
      data <- filtered_data()
      write.csv(data, file, row.names = FALSE)
    }
  )

  output$plot <- renderPlot({
    data <- filtered_data()
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)


# CSS is an extremely popular markup language that is used to tell the browser how to display elements on a page
# You need to use CSS if you want to deviate from the default look-and-feel of Shiny and want to customize the appearance of different items in your app

# Recall that CSS is comprised of a set of rules, where each rule is a property: value pair associated with an element on the page
# It's possible to include CSS in your app by writing it in a separate file and importing it with includeCSS(), but in this course we will use the simpler approach of placing the CSS code inside tags$style() in the UI

my_css <- "
#download_data {
  /* Change the background colour of the download button
     to orange. */
  background: orange;

  /* Change the text size to 20 pixels. */
  font-size: 20px;
}

#table {
  /* Change the text colour of the table to red. */
  color: red;
}
"

ui <- fluidPage(
  h1("Gapminder"),
  tags$style(my_css),
  tabsetPanel(
    tabPanel(
      title = "Inputs",
      sliderInput(inputId = "life", label = "Life expectancy",
                  min = 0, max = 120,
                  value = c(30, 50)),
      selectInput("continent", "Continent",
                  choices = c("All", levels(gapminder$continent))),
      downloadButton("download_data")
    ),
    tabPanel(
      title = "Plot",
      plotOutput("plot")
    ),
    tabPanel(
      title = "Table",
      DT::dataTableOutput("table")
    )
  )
)

server <- function(input, output) {
  filtered_data <- reactive({
    data <- gapminder
    data <- subset(
      data,
      lifeExp >= input$life[1] & lifeExp <= input$life[2]
    )
    if (input$continent != "All") {
      data <- subset(
        data,
        continent == input$continent
      )
    }
    data
  })
  
  output$table <- DT::renderDataTable({
    data <- filtered_data()
    data
  })

  output$download_data <- downloadHandler(
    filename = "gapminder_data.csv",
    content = function(file) {
      data <- filtered_data()
      write.csv(data, file, row.names = FALSE)
    }
  )

  output$plot <- renderPlot({
    data <- filtered_data()
    ggplot(data, aes(gdpPercap, lifeExp)) +
      geom_point() +
      scale_x_log10()
  })
}

shinyApp(ui, server)

Chapter 4 - Word Clouds in Shiny

Word clouds in Shiny:

  • Word clouds are a visual representation of text, with size roughly proportional to frequency
  • Suppose that the function create_wordcloud(data, numwords=, background=) # background is a quoted color, data is the text as either a string or a vector of strings, numwords is the number of words to show
  • Objective is to use Shiny to take the create_wordcloud() functionality and envelope it in an easy-to-use Shiny App
  • Word clouds are a new type of output, requiring new functions for the UI and server
    • wordcloud2Output()
    • renderWordcloud2()

Adding word sources:

  • Can be helpful to allow the user to customize the data source
    • Given the function in this case, textInput() could theoretically work if the user just wants to enter a single line
    • Can instead use textAreaInput() which has a bigger space and a customizable scroll bar
  • Alternately, can allow users to upload a file to Shiny
    • fileInput(inputId=, label=, …)
    • Will lead to the Browse… pop-up and the user can pick 1+ files from their computer (can restrict the file types)
    • The file is then uploaded and becomes available to Shiny
    • The input$inputId will now be a data frame with 1 row per file - name, size, type, datapath (most important variable - location of file on Shiny server)
    • So, input\(inputId\)datapath will provide access to the file uploaded by the user

Combining all word sources:

  • Can allow the users to pick the way they want to enter text - directly, as a textArea, or by uploading a file
  • The radioButton(inputId=, label=, choices=, selected=)
    • Can create a named vector for choices=c(“showName1” = “useName1”, “showName2”=“useName2”, …) to allow users to see more intuitive text
  • Conditional panels can be used in the UI when you want to show/hide certain UI elements based on another value
    • conditionalPanel(condition, …)
    • The condition is similar to R code, but input$ becomes input.
    • The … are all the UI elements that you want to include subject to the condition
    • The condition statement is also quoted, so condition=“input.time_of_day == ‘Morning’” would be the check for whether input$time_of_day is ‘Morning’

Fine-tune reactivity:

  • The reactive() and input$ are reactive - code depending on reactive variables re-runs when dependencies update
  • Accessing a reactive value makes it a dependency
    • x <- reactive({ y() * input\(num1 }) # re-runs any time y or input\)num1 is updated
  • Isolating a chunk of code is used to NOT create a reactive dependency
    • If the reactive value inside reactive() is updated, then nothing happens
    • x <- reactive({ y() * isolate({ input\(num1 }) * input\)num2 }) # if input$num1 is changed, x will not be re-evaluated
    • x <- reactive({ y() * isolate({ input\(num1 * input\)num2 }) }) # if input\(num1 or input\)num2 is changed, x will not be re-evaluated
  • Sometimes, you may want to isolate all of the ractive variables, so that all calls are reactive({ isolate({ … }) })
    • The user is then the only one who can drive updates to the outputs, and by way of specified commands
  • The action button is set up as actionButton(inputId, label, …)
    • There is only one action with a button - clicking - and so the value of the button is simply the number of times it has been clicked
    • The button is then a very valuable trigger to reactivity by way of any call to the button’s value
    • x <- reactive({ calculate_x\(value ; isolate({ y() * input\)num1 * input$num2 }) })
    • if input\(num1 or input\)num2 or y() is changed, x will not be re-evaluated; but, if calculate_x changes (button clicked), then x is recalculated

Wrap up:

  • Plotting application - inputs and general Shiny capabilities for experimenting with plots
  • Data exploration application - outputs and reactive variables, plus tricks for ease-of-use
  • Word cloud application - exposing an R function in a GUI, isolation to mitigate reactivity

Examle code includes:

# You are provided with a sample dataset named artofwar, which contains the entire text of the Art of War book
# You can inspect the given Art of War text by running head(artofwar) or tail(artofwar) to see the first and last few verses of the book

# As mentioned in the video, since word clouds are not an output you saw before, they require a new pair of output and render functions: wordcloud2Output() and renderWordcloud2()
# These output functions are available from the wordcloud2 package

# The function create_wordcloud(), the dataset artofwar, and all the necessary packages are available in your workspace

fileCon <- file("./RInputFiles/artofwar.txt", "r")
fileLines <- readLines(fileCon)
close(fileCon)

length(fileLines)
modLines <- fileLines[1564:6982]
modLines <- modLines[modLines != "" & !grepl(pattern="^---", modLines)]
length(modLines)

artofwar <- modLines

library(tm)
library(wordcloud2)

create_wordcloud <- function(data, num_words = 100, background = "white") {
  # If text is provided, convert it to a dataframe of word frequencies
  if (is.character(data)) {
    corpus <- Corpus(VectorSource(data))
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    tdm <- as.matrix(TermDocumentMatrix(corpus))
    data <- sort(rowSums(tdm), decreasing = TRUE)
    data <- data.frame(word = names(data), freq = as.numeric(data))
  }
  # Make sure a proper num_words is provided
  if (!is.numeric(num_words) || num_words < 3) {
    num_words <- 3
  }  
  # Grab the top n most common words
  data <- head(data, n = num_words)
  if (nrow(data) == 0) {
    return(NULL)
  }
  wordcloud2(data, backgroundColor = background)
}

# Define UI for the application
ui <- fluidPage(
  h1("Word Cloud"),
  # Add the word cloud output placeholder to the UI
  wordcloud2Output(outputId = "cloud")
)

# Define the server logic
server <- function(input, output) {
  # Render the word cloud and assign it to the output list
  output$cloud <- renderWordcloud2({
    # Create a word cloud object
    create_wordcloud(artofwar)
  })
}

# Run the application
shinyApp(ui = ui, server = server)


# Recall that create_wordcloud() has two optional arguments: num_words, which is an integer specifying the maximum number of words to draw, and background, which specifies the background colour of the image

# Right now, the Shiny app simply outputs a word cloud with the exact same parameters all the time
# Since the word cloud generating function accepts these two parameters, it would be wasteful not to use them
# The parameters should be adjustable by the user using Shiny inputs

# You task is to add two inputs to the Shiny app, and use the values from these inputs as the num_words and background parameters of the word cloud

# All the required packages, including colourpicker, have been loaded to your workspace

library(colourpicker)

ui <- fluidPage(
  h1("Word Cloud"),
  # Add a numeric input for the number of words
  numericInput(inputId = "num", label = "Maximum number of words",
      value = 100, min = 5),
  # Add a colour input for the background colour
  colourInput(inputId="col", label="Background colour", value="white"),
  wordcloud2Output("cloud")
)

server <- function(input, output) {
  output$cloud <- renderWordcloud2({
    # Use the values from the two inputs as
    # parameters to the word cloud
    create_wordcloud(artofwar,
                     num_words = input$num, background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# The app currently has very few objects (one title, two inputs, one word cloud output) so it is still manageable without a layout
# However, the app is going to grow in the next exercises and having a sidebar layout will be beneficial
# It's a good idea to add a layout to your app earlier rather than later, because placing new Shiny UI elements into an existing layout is easier than rearranging a larger non-structured app later

# As is commonly done with Shiny apps and other interactive applications, the inputs will be kept in the smaller sidebar, while the main output (the word cloud) will be in the larger main panel

# Your next task is to add a sidebar layout to the current Shiny app. No new UI elements are to be added other than the structure for the layout

ui <- fluidPage(
  h1("Word Cloud"),
  # Add a sidebar layout to the UI
  sidebarLayout(
    # Define a sidebar panel around the inputs
    sidebarPanel(
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    # Define a main panel around the output
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  output$cloud <- renderWordcloud2({
    create_wordcloud(artofwar,
                     num_words = input$num, background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# The textAreaInput() is useful when you want to allow the user to enter much longer text than what a typical textInput() allows
# Textareas span multiple rows and have a vertical scrollbar, as well as a rows parameter that can determine how many rows are visible

# Except for being larger, textarea inputs behave very similar to text inputs in every other way

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      # Add a textarea input
      textAreaInput(inputId="text", label="Enter text", rows=7),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  output$cloud <- renderWordcloud2({
    # Use the textarea's value as the word cloud data source
    create_wordcloud(input$text, num_words = input$num,
                     background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# Rather than typing a long piece of text into a box, it can be more convenient to upload a text file if the text is extremely long

# Uploading files to a Shiny app is done using fileInput()

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      textAreaInput("text", "Enter text", rows = 7),
      # Add a file input
      fileInput(inputId="file", label="Select a file"),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  output$cloud <- renderWordcloud2({
    create_wordcloud(input$text, num_words = input$num,
                     background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# After the user selects a file, that file gets uploaded to the computer that runs the Shiny app, and it becomes available in the server

# If the input ID of a file input is "myfile", then you might expect input$myfile to give you access to the file that was uploaded, but that is not how file inputs actually work
# input$myfile will return a data.frame that contains a few pieces of metadata about the selected file, with the main one to care about being datapath
# Assuming the file input's ID is "myfile", input$myfile$datapath will be the path where the file is located

# After getting the uploaded file's path (for example C:\Users\Dean\AppData\Local\Temp\path\to\file.txt), this path can be used to read the file in whatever way you need
# You may use read.csv() if the uploaded file is a CSV file, or readLines() if you simply want to read all the lines in the file, or any other function that accepts a file path

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      textAreaInput("text", "Enter text", rows = 7),
      fileInput("file", "Select a file"),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  # Define a reactive variable named `input_file`
  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    # Read the text in the uploaded file
    readLines(input$file$datapath)
  })

  output$cloud <- renderWordcloud2({
    # Use the reactive variable as the word cloud data source
    create_wordcloud(input_file(), num_words = input$num,
                     background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# Over the last few exercises, you've used 3 different sources for the word cloud: the Art of War book, a text field, and a text file
# However, only one source was working at any given time
# In this exercise, you will provide the user with a way to select which data source to use for the word cloud

# Your task is to add radio buttons to the app that will let the user select whether the word source should be the Art of War book, the textarea, or an uploaded file

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      # Add radio buttons input
      radioButtons(
        inputId = "source",
        label = "Word source",
        choices = c(
          # First choice is "book", with "Art of War" displaying
          "Art of War" = "book",
          # Second choice is "own", with "Use your own words" displaying
          "Use your own words" = "own",
          # Third choice is "file", with "Upload a file" displaying
          "Upload a file" = "file"
        )
      ),
      textAreaInput("text", "Enter text", rows = 7),
      fileInput("file", "Select a file"),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    readLines(input$file$datapath)
  })

  output$cloud <- renderWordcloud2({
    create_wordcloud(input_file(), num_words = input$num,
                     background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# When working with radio buttons, sometimes you need to use conditional logic (if-else statements) when accessing the radio button's value in the server
# This is necessary when different actions are performed depending on the exact choice, and the chosen value needs to be inspected before deciding how to proceed

# For example, with the radio buttons that select a data source, different code will need to run depending on which choice is selected

# Your next task is to use the appropriate data source in the word cloud function, according to what radio button the user chooses

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      radioButtons(
        inputId = "source",
        label = "Word source",
        choices = c(
          "Art of War" = "book",
          "Use your own words" = "own",
          "Upload a file" = "file"
        )
      ),
      textAreaInput("text", "Enter text", rows = 7),
      fileInput("file", "Select a file"),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  # Create a "data_source" reactive variable
  data_source <- reactive({
    # Return the appropriate data source depending on
    # the chosen radio button
    if (input$source == "book") {
      data <- artofwar
    } else if (input$source == "own") {
      data <- input$text
    } else if (input$source == "file") {
      data <- input_file()
    }
    return(data)
  })

  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    readLines(input$file$datapath)
  })

  output$cloud <- renderWordcloud2({
    # Use the data_source reactive variable as the data
    # in the word cloud function
    create_wordcloud(data_source(), num_words = input$num,
                     background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# The word cloud app now has three different ways to supply words to the word cloud
# Two of these methods involve a specific UI element that is only useful for them: there is a textarea that is only used when the user selects the "own" word source, and there is a file input that is only relevant when the user chooses the "file" source
# Ideally, only inputs that are needed would appear at any given moment

# The textarea has already been wrapped in a conditionalPanel() so that it will only appear when the user chooses to input their own text
# Your task is to conditionally show the file input only when the user selects the file upload as the data source

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      radioButtons(
        inputId = "source",
        label = "Word source",
        choices = c(
          "Art of War" = "book",
          "Use your own words" = "own",
          "Upload a file" = "file"
        )
      ),
      conditionalPanel(
        condition = "input.source == 'own'",
        textAreaInput("text", "Enter text", rows = 7)
      ),
      # Wrap the file input in a conditional panel
      conditionalPanel(
        # The condition should be that the user selects
        # "file" from the radio buttons
        condition = "input.source == 'file'",
        fileInput("file", "Select a file")
      ),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  data_source <- reactive({
    if (input$source == "book") {
      data <- artofwar
    } else if (input$source == "own") {
      data <- input$text
    } else if (input$source == "file") {
      data <- input_file()
    }
    return(data)
  })
  
  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    readLines(input$file$datapath)
  })
  
  output$cloud <- renderWordcloud2({
    create_wordcloud(data_source(), num_words = input$num,
                        background = input$col)
  })
}

shinyApp(ui = ui, server = server)


# The word cloud app now has several different inputs, and modifying each one of them causes the word cloud to redraw with the new set of parameters, just as expected

# But this behaviour can also be annoying sometimes
# For example, when typing text in the textarea, the word cloud keeps regenerating without waiting for you to finish typing
# This can be controlled with isolate()

# All the code inside renderWordcloud2() that renders the word cloud has been removed
# Your task is to re-create the word cloud and isolate it so that changing the parameters will not automatically trigger a new word cloud

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      radioButtons(
        inputId = "source",
        label = "Word source",
        choices = c(
          "Art of War" = "book",
          "Use your own words" = "own",
          "Upload a file" = "file"
        )
      ),
      conditionalPanel(
        condition = "input.source == 'own'",
        textAreaInput("text", "Enter text", rows = 7)
      ),
      conditionalPanel(
        condition = "input.source == 'file'",
        fileInput("file", "Select a file")
      ),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  data_source <- reactive({
    if (input$source == "book") {
      data <- artofwar
    } else if (input$source == "own") {
      data <- input$text
    } else if (input$source == "file") {
      data <- input_file()
    }
    return(data)
  })
  
  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    readLines(input$file$datapath)
  })
  
  output$cloud <- renderWordcloud2({
    # Isolate the code to render the word cloud so that it will
    # not automatically re-render on every parameter change
    isolate({
      # Render the word cloud using inputs and reactives
      create_wordcloud(data_source(), num_words=input$num, background=input$col)
    })
  })
}

shinyApp(ui = ui, server = server)


# After isolating the word cloud render code so that it wouldn't update too often, the last step is to provide a way to render the word cloud only when the user chooses to
# This can be achieved with the help of an actionButton()

ui <- fluidPage(
  h1("Word Cloud"),
  sidebarLayout(
    sidebarPanel(
      radioButtons(
        inputId = "source",
        label = "Word source",
        choices = c(
          "Art of War" = "book",
          "Use your own words" = "own",
          "Upload a file" = "file"
        )
      ),
      conditionalPanel(
        condition = "input.source == 'own'",
        textAreaInput("text", "Enter text", rows = 7)
      ),
      conditionalPanel(
        condition = "input.source == 'file'",
        fileInput("file", "Select a file")
      ),
      numericInput("num", "Maximum number of words",
                   value = 100, min = 5),
      colourInput("col", "Background colour", value = "white"),
      # Add a "draw" button to the app
      actionButton(inputId = "draw", label = "Draw!")
    ),
    mainPanel(
      wordcloud2Output("cloud")
    )
  )
)

server <- function(input, output) {
  data_source <- reactive({
    if (input$source == "book") {
      data <- artofwar
    } else if (input$source == "own") {
      data <- input$text
    } else if (input$source == "file") {
      data <- input_file()
    }
    return(data)
  })
  
  input_file <- reactive({
    if (is.null(input$file)) {
      return("")
    }
    readLines(input$file$datapath)
  })
  
  output$cloud <- renderWordcloud2({
    # Add the draw button as a dependency to
    # cause the word cloud to re-render on click
    input$draw
    isolate({
      create_wordcloud(data_source(), num_words = input$num,
                       background = input$col)
    })
  })
}

shinyApp(ui = ui, server = server)

Introduction to Statistics with R: Moderation and Mediation

Chapter 1 - Introduction to Moderation

Introduction to Moderation:

  • Stereotyping threats in sociological sciences have been shown to impact performance
  • Suppose that X = strereotype threat (test/control factor), Y=IQ Test, Z=Working Memory

Use of Moderation:

  • Moderator variables enhance the model when the relationship between X and Y varies with Z
  • Applies to correlational research as well as to regression research

Moderation Model:

  • Y = B0 + B1 * X1 + B2 * X2 + B3 * (X1*X2) + eps # this is the full moderation model where there is a different intercept and also a different slope for factor levels of X2
  • With categorical predictor variables, this becomes a little more complicated
    • Y = B0 + B1X1 + B2X2 + B3X3 + B4(X1X3) + B5(X2*X3) # this will assume that the goal is Y ~ X3, with moderation by (likely dummy) variables X1 and X2

Testing for Moderation:

  • If X and Z are continuous, can just add XZ as a term to the model and assess the regression coefficient (and significance) of XZ
  • With one categorical and one continuous variable, need to create dummy variables for the categorical variable and do a full multiplication of the dummies (n-1) and the continuous variable
    • Can look at the set of regression coefficients for the interaction terms of dummy with continuous
    • The ANOVA or NHST can then assess whether there is a statistically significant change in predictive power of the models

Example code includes:

# The example, used in this chapter and described in the video, is based on the idea of stereotype threat
# The independent variable is the experimental manipulation or the stereotype threat and the dependent variable is the IQ test score (iq)
# The variable working memory capacity (wm) is the moderator
# In this chapter you want to investigate how stereotype threat affects the IQ test scores with the idea that maybe working memory moderates that effect

# The experiment is conducted in the following way: - students completed a working memory test. - students completed an IQ test. - students are randomly assigned to one of three experimental conditions: an explicit threat, an implicit threat or no threat
# Each group consists of 50 students.

mod <- data.frame(stringsAsFactors=FALSE, 
                  subject=1:150, 
                  condition=factor(c(rep("control", 50), rep("threat1", 50), rep("threat2", 50))), 
                  iq=c(134, 121, 86, 74, 80, 105, 100, 121, 138, 104, 77, 101, 128, 103, 87, 81, 71, 81, 100, 106, 90, 79, 102, 99, 46, 78, 82, 106, 123, 106, 97, 83, 133, 72, 93, 117, 112, 112, 97, 114, 64, 95, 126, 141, 73, 81, 68, 108, 106, 93, 36, 56, 44, 41, 80, 58, 68, 54, 64, 57, 32, 68, 46, 76, 35, 51, 55, 68, 41, 50, 42, 55, 37, 54, 59, 43, 37, 48, 75, 77, 83, 59, 41, 60, 59, 44, 55, 49, 21, 23, 51, 44, 68, 48, 47, 55, 44, 59, 53, 38, 39, 59, 36, 57, 57, 49, 52, 60, 47, 37, 47, 48, 36, 38, 46, 73, 63, 73, 47, 55, 45, 32, 35, 37, 50, 41, 38, 64, 59, 50, 54, 53, 30, 43, 79, 57, 34, 54, 42, 68, 60, 30, 29, 40, 52, 31, 44, 28, 42, 61), 
                  wm=c(91, 145, 118, 105, 96, 133, 99, 97, 96, 105, 86, 101, 113, 98, 86, 83, 116, 136, 117, 71, 90, 100, 88, 81, 90, 110, 105, 84, 115, 72, 110, 77, 112, 90, 82, 85, 101, 117, 80, 96, 96, 110, 113, 135, 118, 159, 97, 106, 79, 119, 82, 109, 73, 108, 126, 99, 129, 85, 119, 94, 86, 126, 97, 98, 84, 131, 103, 138, 95, 88, 91, 96, 92, 93, 106, 91, 85, 104, 138, 116, 120, 106, 89, 112, 130, 72, 111, 78, 77, 78, 111, 100, 108, 96, 107, 99, 94, 93, 91, 86, 69, 86, 75, 110, 93, 95, 102, 114, 84, 79, 97, 70, 76, 90, 55, 121, 114, 129, 101, 119, 88, 94, 92, 65, 85, 87, 75, 120, 103, 91, 90, 84, 61, 92, 122, 109, 86, 114, 75, 133, 129, 103, 103, 77, 119, 93, 99, 70, 90, 85), 
                  d1=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                  d2=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
                  )

mod$WM.centered <- mod$wm - mean(mod$wm)
str(mod)
## 'data.frame':    150 obs. of  7 variables:
##  $ subject    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ condition  : Factor w/ 3 levels "control","threat1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ iq         : num  134 121 86 74 80 105 100 121 138 104 ...
##  $ wm         : num  91 145 118 105 96 133 99 97 96 105 ...
##  $ d1         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ d2         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ WM.centered: num  -8.08 45.92 18.92 5.92 -3.08 ...
summary(mod)
##     subject         condition        iq               wm        
##  Min.   :  1.00   control:50   Min.   : 21.00   Min.   : 55.00  
##  1st Qu.: 38.25   threat1:50   1st Qu.: 44.25   1st Qu.: 86.00  
##  Median : 75.50   threat2:50   Median : 58.50   Median : 96.50  
##  Mean   : 75.50                Mean   : 66.02   Mean   : 99.08  
##  3rd Qu.:112.75                3rd Qu.: 81.00   3rd Qu.:111.00  
##  Max.   :150.00                Max.   :141.00   Max.   :159.00  
##        d1               d2          WM.centered    
##  Min.   :0.0000   Min.   :0.0000   Min.   :-44.08  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:-13.08  
##  Median :0.0000   Median :0.0000   Median : -2.58  
##  Mean   :0.3333   Mean   :0.3333   Mean   :  0.00  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 11.92  
##  Max.   :1.0000   Max.   :1.0000   Max.   : 59.92
# Summary statistics
psych::describeBy(mod, mod$condition)
## 
##  Descriptive statistics by group 
## group: control
##             vars  n   mean    sd median trimmed   mad    min    max range
## subject        1 50  25.50 14.58  25.50   25.50 18.53   1.00  50.00    49
## condition*     2 50   1.00  0.00   1.00    1.00  0.00   1.00   1.00     0
## iq             3 50  97.88 20.93  99.50   97.47 25.20  46.00 141.00    95
## wm             4 50 102.18 18.79  99.50  100.55 20.02  71.00 159.00    88
## d1             5 50   0.00  0.00   0.00    0.00  0.00   0.00   0.00     0
## d2             6 50   0.00  0.00   0.00    0.00  0.00   0.00   0.00     0
## WM.centered    7 50   3.10 18.79   0.42    1.47 20.02 -28.08  59.92    88
##             skew kurtosis   se
## subject     0.00    -1.27 2.06
## condition*   NaN      NaN 0.00
## iq          0.04    -0.50 2.96
## wm          0.73     0.41 2.66
## d1           NaN      NaN 0.00
## d2           NaN      NaN 0.00
## WM.centered 0.73     0.41 2.66
## -------------------------------------------------------- 
## group: threat1
##             vars  n   mean    sd median trimmed   mad    min    max range
## subject        1 50  75.50 14.58  75.50   75.50 18.53  51.00 100.00    49
## condition*     2 50   2.00  0.00   2.00    2.00  0.00   2.00   2.00     0
## iq             3 50  52.16 13.79  52.00   51.75 11.86  21.00  83.00    62
## wm             4 50 100.80 16.85  97.50   99.90 16.31  72.00 138.00    66
## d1             5 50   1.00  0.00   1.00    1.00  0.00   1.00   1.00     0
## d2             6 50   0.00  0.00   0.00    0.00  0.00   0.00   0.00     0
## WM.centered    7 50   1.72 16.85  -1.58    0.82 16.31 -27.08  38.92    66
##             skew kurtosis   se
## subject     0.00    -1.27 2.06
## condition*   NaN      NaN 0.00
## iq          0.16    -0.25 1.95
## wm          0.48    -0.56 2.38
## d1           NaN      NaN 0.00
## d2           NaN      NaN 0.00
## WM.centered 0.48    -0.56 2.38
## -------------------------------------------------------- 
## group: threat2
##             vars  n   mean    sd median trimmed   mad    min    max range
## subject        1 50 125.50 14.58 125.50  125.50 18.53 101.00 150.00    49
## condition*     2 50   3.00  0.00   3.00    3.00  0.00   3.00   3.00     0
## iq             3 50  48.02 12.45  47.00   47.40 14.83  28.00  79.00    51
## wm             4 50  94.26 18.77  92.00   93.97 17.79  55.00 133.00    78
## d1             5 50   0.00  0.00   0.00    0.00  0.00   0.00   0.00     0
## d2             6 50   1.00  0.00   1.00    1.00  0.00   1.00   1.00     0
## WM.centered    7 50  -4.82 18.77  -7.08   -5.10 17.79 -44.08  33.92    78
##             skew kurtosis   se
## subject     0.00    -1.27 2.06
## condition*   NaN      NaN 0.00
## iq          0.38    -0.56 1.76
## wm          0.16    -0.71 2.65
## d1           NaN      NaN 0.00
## d2           NaN      NaN 0.00
## WM.centered 0.16    -0.71 2.65
# Create a boxplot of the data
boxplot(iq ~ condition, data=mod, main="Boxplot", xlab="Group Condition", ylab='IQ')

# A moderator predicts that the correlation between the predictor and the outcome will change in function of the group
# So if a clear change is observed in the correlations as a function of group, then this can indicate a significant moderation effect

# In this exercise, you have to calculate the correlations between the IQ scores iq and the working memory capacity wm by group

# Make the subset for the group condition = "control"
mod_control <- subset(mod, condition=="control")

# Make the subset for the group condition = "threat1"
mod_threat1 <- subset(mod, condition=="threat1")

# Make the subset for the group condition = "threat2"
mod_threat2 <- subset(mod, condition=="threat2")

# Calculate the correlations
cor(mod_control$iq, mod_control$wm)
## [1] 0.1079827
cor(mod_threat1$iq, mod_threat1$wm)
## [1] 0.7231095
cor(mod_threat2$iq, mod_threat2$wm)
## [1] 0.6772917
# To perform a moderation analysis, one needs to consider two models: one without moderation and one with moderation

# Construct a model without any moderation, model_1, that represents the relationship between working memory capacity wm and IQ scores iq and the effect of stereotype threat
# Recall that stereotype threat is a categorical predictor so it has to be dummy coded
# The dummy variables are already included in the dataset and are called d1 and d2
# They are representing respectively categories threat1 and threat2
# Notice that you use the control group as reference group

# Model without moderation (tests for "first-order effects")
model_1 <- lm(mod$iq ~ mod$wm + mod$d1 + mod$d2)

# Make a summary of model_1
summary(model_1)
## 
## Call:
## lm(formula = mod$iq ~ mod$wm + mod$d1 + mod$d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.339  -7.294   0.744   7.608  42.424 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.78635    7.14360   8.369 4.30e-14 ***
## mod$wm        0.37281    0.06688   5.575 1.16e-07 ***
## mod$d1      -45.20552    2.94638 -15.343  < 2e-16 ***
## mod$d2      -46.90735    2.99218 -15.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 146 degrees of freedom
## Multiple R-squared:  0.7246, Adjusted R-squared:  0.719 
## F-statistic: 128.1 on 3 and 146 DF,  p-value: < 2.2e-16
# Create new predictor variables
wm_d1 <- mod$wm * mod$d1
wm_d2 <- mod$wm * mod$d2

# Model with moderation
model_2 <- lm(mod$iq ~ mod$wm + mod$d1 + mod$d2 + wm_d1 + wm_d2)

# Make a summary of model_2
summary(model_2)
## 
## Call:
## lm(formula = mod$iq ~ mod$wm + mod$d1 + mod$d2 + wm_d1 + wm_d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.414  -7.181   0.420   8.196  40.864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85.5851    11.3576   7.535 4.95e-12 ***
## mod$wm        0.1203     0.1094   1.100  0.27303    
## mod$d1      -93.0952    16.8573  -5.523 1.52e-07 ***
## mod$d2      -79.8970    15.4772  -5.162 7.96e-07 ***
## wm_d1         0.4716     0.1638   2.880  0.00459 ** 
## wm_d2         0.3288     0.1547   2.125  0.03529 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.38 on 144 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.7319 
## F-statistic: 82.35 on 5 and 144 DF,  p-value: < 2.2e-16
# A model comparison might be performed formally by comparing model_1 versus model_2
# The significance of a null hypothesis is tested.

# Compare model_1 and model_2
anova(model_1, model_2)
## Analysis of Variance Table
## 
## Model 1: mod$iq ~ mod$wm + mod$d1 + mod$d2
## Model 2: mod$iq ~ mod$wm + mod$d1 + mod$d2 + wm_d1 + wm_d2
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1    146 31655                              
## 2    144 29784  2    1871.3 4.5238 0.01243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Choose colors to represent the points by group
color <- c("red","green","blue")

# Illustration of the first-order effects of working memory on IQ
ggplot(mod, aes(x = wm, y = iq)) + geom_smooth(method = "lm", color = "black") +
  geom_point(aes(color = condition))

# Illustration of the moderation effect of working memory on IQ
ggplot(mod, aes(x = wm, y = iq)) +
  geom_smooth(aes(group = condition), method = "lm", se = T, color = "black", fullrange = T) +
  geom_point(aes(color = condition))


Chapter 2 - Introduction to Centering Predictors

Introduction to Centering:

  • Centering is simply taking each predictor variable and subtracting the mean for each predictor variable (so that the new means are zero)
  • Centering has benefits for both conceptual (interpretation) and mathematical (statistics) reasons

Conceptual Reasons for Centering:

  • The intercept is difficult to interpret if the X-variables being zero makes no sense
    • For example, if Child Vocabulary ~ Mom Vocabulary + Age, Mom Vocabulary=0 and Age=0 make no sense
    • With centered predictors, on the other hand, the intercept would now be Child Vocabulary when the Mom has an average Vocabulary and the child has an average Age
  • With moderation, a single slope between X and Y does not exist, since it changes with respect to moderator variable Z
    • The single slope is most meaningful (accurate) in the region where all the predictor variables are at their means

Statistical Reason:

  • Without centering, X*Z can become highly correlated with X and Z just based on magnitudes
    • This introduces the multicollinearity problem for the coefficients - including, potentially, the matrix maths crashing out

Summary:

  • Centering predictors is simple and has both conceptual (interpretation) and statistical (multicollinearity) advantages

Example code includes:

# The example used in this chapter is the same as in the previous one
# Recall that the independent variable is the experimental manipulation, so the stereotype threat, and the dependent variable is the IQ test score (iq)
# The variable working memory capacity (wm) is the moderator
# Furthermore, remember that you want to investigate how stereotype threat affects the IQ test scores with the idea that maybe working memory moderates that effect

# Now you will look at the basic concepts of centering predictors
# The data set mod, which is loaded into your workspace, already contains the centered data of the variable wm, namely wm.centered

mod <- mod %>% rename(wm.centered=WM.centered)
str(mod)
## 'data.frame':    150 obs. of  7 variables:
##  $ subject    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ condition  : Factor w/ 3 levels "control","threat1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ iq         : num  134 121 86 74 80 105 100 121 138 104 ...
##  $ wm         : num  91 145 118 105 96 133 99 97 96 105 ...
##  $ d1         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ d2         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wm.centered: num  -8.08 45.92 18.92 5.92 -3.08 ...
# Define wm_center
wm_center <- mod$wm - mean(mod$wm)

# Compare with the variable wm.centered
all.equal(wm_center, mod$wm.centered)
## [1] TRUE
# Model without moderation and with centered data
model_1_centered <- lm(mod$iq ~ mod$wm.centered + mod$d1 + mod$d2)

# Make a summary of model_1_centered
summary(model_1_centered)
## 
## Call:
## lm(formula = mod$iq ~ mod$wm.centered + mod$d1 + mod$d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.339  -7.294   0.744   7.608  42.424 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      96.72429    2.09267  46.220  < 2e-16 ***
## mod$wm.centered   0.37281    0.06688   5.575 1.16e-07 ***
## mod$d1          -45.20552    2.94638 -15.343  < 2e-16 ***
## mod$d2          -46.90735    2.99218 -15.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 146 degrees of freedom
## Multiple R-squared:  0.7246, Adjusted R-squared:  0.719 
## F-statistic: 128.1 on 3 and 146 DF,  p-value: < 2.2e-16
# In the previous chapter you set up a model (model_2) that represents the relationship between working memory capacity (wm) and IQ scores (iq) and the effect of stereotype threat with moderation
# The model (model_2) is loaded in the workspace and might be used for comparison with models based on centered data

# To see what happens when you use centered data, you first have to make a model that represents the same as model_2 but uses centered data instead

# Create new predictor variables
wm_d1_centered <- mod$wm.centered * mod$d1
wm_d2_centered <- mod$wm.centered * mod$d2

# Define model_2_centered
model_2_centered <- lm(mod$iq ~ mod$wm.centered + mod$d1 + mod$d2 + wm_d1_centered + wm_d2_centered)

# Make a summary of model_2_centered
summary(model_2_centered)
## 
## Call:
## lm(formula = mod$iq ~ mod$wm.centered + mod$d1 + mod$d2 + wm_d1_centered + 
##     wm_d2_centered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.414  -7.181   0.420   8.196  40.864 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      97.5070     2.0619  47.289  < 2e-16 ***
## mod$wm.centered   0.1203     0.1094   1.100  0.27303    
## mod$d1          -46.3652     2.9038 -15.967  < 2e-16 ***
## mod$d2          -47.3223     2.9439 -16.075  < 2e-16 ***
## wm_d1_centered    0.4716     0.1638   2.880  0.00459 ** 
## wm_d2_centered    0.3288     0.1547   2.125  0.03529 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.38 on 144 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.7319 
## F-statistic: 82.35 on 5 and 144 DF,  p-value: < 2.2e-16
# Compare model_1_centered and model_2_centered
anova(model_1_centered, model_2_centered)
## Analysis of Variance Table
## 
## Model 1: mod$iq ~ mod$wm.centered + mod$d1 + mod$d2
## Model 2: mod$iq ~ mod$wm.centered + mod$d1 + mod$d2 + wm_d1_centered + 
##     wm_d2_centered
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1    146 31655                              
## 2    144 29784  2    1871.3 4.5238 0.01243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare model_1 and model_2
anova(model_1, model_2)
## Analysis of Variance Table
## 
## Model 1: mod$iq ~ mod$wm + mod$d1 + mod$d2
## Model 2: mod$iq ~ mod$wm + mod$d1 + mod$d2 + wm_d1 + wm_d2
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1    146 31655                              
## 2    144 29784  2    1871.3 4.5238 0.01243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# It is also interesting to calculate some correlations, because you can get mathematical or computational problems when two predictor variables are highly correlated

# Recall that when two predictor variables in a glm are so highly correlated that they are essentially redundant, it can become difficult to estimate the values associated with each predictor
# So in the case of multicolinearity you can run into problems

# The question is now: can centering avoid this problem? To answer this question, you first have to calculate some correlations

# Calculate the correlations between working memory capacity and the product terms
cor_wmd1 <- cor(mod$wm, wm_d1)
cor_wmd2 <- cor(mod$wm, wm_d2)
cor_wmd1_centered <- cor(mod$wm.centered, wm_d1_centered)
cor_wmd2_centered <- cor(mod$wm.centered, wm_d2_centered)

# Calculate the correlations between the dummy variables and the product terms
cor_d1d1<- cor(mod$d1, wm_d1)
cor_d2d2 <- cor(mod$d2, wm_d2)
cor_d1d1_centered <- cor(mod$d1, wm_d1_centered)
cor_d2d2_centered <- cor(mod$d2, wm_d2_centered)

# correlations
rbind(c(cor_wmd1, cor_wmd2), c(cor_wmd1_centered, cor_wmd2_centered))
##           [,1]        [,2]
## [1,] 0.1696231 -0.04339472
## [2,] 0.5298278  0.61198031
rbind(c(cor_d1d1, cor_d2d2), c(cor_d1d1_centered, cor_d2d2_centered))
##            [,1]       [,2]
## [1,] 0.98007096  0.9720607
## [2,] 0.08388962 -0.2071665

Chapter 3 - Introduction to Mediation

Introduction to Mediation:

  • Moderator variables have influence over relationships
  • Mediator variables account for and/or explain relationships between X and Y
  • Continuing with the previous examples where X is a stereotype threat, Y is a test score, and Z is working memory, now treated as a mediator variable
    • Basically, the proposed mechanism is the Threat influences Working Memory which drives lower Test Scores

Basic Principles of Mediation:

  • A mediation analysis is typically conducted to better understand an observed effect of an independent variable on a dependent variable (or correlation between X and Y)
    • Why and how does A impact B?
    • Mediation is a tool to start getting more information from correlations, possibly leading to somewhat stronger claims about causality
  • The basic principles would be as follows, supposing that Y is the dependent variable, X is the predictor variable, and M is the mediator variable
    • X should then be a statistically significant predictor of M
    • M should then be a statistically significant predictor of Y
  • When X and Y are correlated BECAUSE of mediator M, then
    • Y ~ X may have a significant coefficient for X
    • Y ~ X + M may no longer have a significant coefficient (or may be less significant) for X

Partial and Full Mediation:

  • Mediator variables account for some or all of the relationship between X and Y
  • Full mediation is when the mediator variable explains all of the variance between X and Y (coefficient for X drops to zero)
  • Partial mediation is when the mediator variable explains some of the variance between X and Y (but the coefficient for X does not drop all the way to zero)
  • Caution that correlation does not imply causation
    • There is a BIG difference between statistical mediation and true causal mediation

Example code includes:

# The example used in this chapter and described in the video, is the idea of stereotype threat
# The independent variable is the experimental manipulation, so the stereotype threat, and the dependent variable is the IQ test score iq
# The variable working memory capacity wm is the mediator
# You want to investigate how stereotype threat affects the IQ test scores and why these two variables influence each other

# The experiment is conducted in the following way:
# students completed a working memory test
# students are randomly assigned to one of two experimental conditions, namely threat or no threat. Each group consists of 50 students.
# students completed an IQ test

# It is always important to look at the data before you do an analysis
# The data is loaded into your workspace under the name med
# Take a look at this data by typing head(med) in the console

med <- data.frame(stringsAsFactors = FALSE, 
                  subject=1:100, 
                  condition=factor(c(rep("control", 50), rep("threat", 50))), 
                  iq=c(73, 128, 83, 83, 64, 95, 94, 89, 98, 108, 86, 77, 86, 100, 97, 87, 121, 106, 109, 95, 104, 136, 118, 95, 82, 95, 96, 102, 103, 107, 112, 137, 90, 81, 93, 91, 105, 91, 114, 82, 102, 79, 110, 71, 94, 108, 82, 103, 113, 91, 84, 83, 92, 117, 68, 97, 94, 64, 81, 90, 92, 101, 99, 88, 98, 99, 116, 51, 101, 103, 77, 89, 78, 82, 83, 75, 65, 79, 93, 82, 85, 65, 82, 104, 85, 99, 82, 103, 76, 70, 91, 69, 99, 87, 102, 80, 92, 74, 81, 69), 
                  wm=c(37, 77, 32, 33, 53, 46, 72, 61, 58, 53, 42, 42, 33, 57, 44, 52, 70, 74, 58, 49, 54, 81, 65, 61, 50, 57, 55, 52, 59, 52, 54, 71, 62, 30, 45, 60, 61, 73, 61, 45, 42, 48, 77, 32, 67, 61, 23, 72, 76, 57, 50, 39, 38, 63, 36, 54, 45, 36, 55, 46, 38, 74, 48, 25, 54, 66, 50, 15, 44, 59, 43, 43, 31, 43, 53, 39, 26, 44, 36, 45, 42, 36, 35, 43, 65, 43, 28, 57, 29, 20, 58, 24, 59, 38, 56, 36, 59, 26, 55, 28)
                  )
str(med)
## 'data.frame':    100 obs. of  4 variables:
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ condition: Factor w/ 2 levels "control","threat": 1 1 1 1 1 1 1 1 1 1 ...
##  $ iq       : num  73 128 83 83 64 95 94 89 98 108 ...
##  $ wm       : num  37 77 32 33 53 46 72 61 58 53 ...
# Summary statistics
psych::describeBy(med, med$condition)
## 
##  Descriptive statistics by group 
## group: control
##            vars  n  mean    sd median trimmed   mad min max range  skew
## subject       1 50 25.50 14.58   25.5   25.50 18.53   1  50    49  0.00
## condition*    2 50  1.00  0.00    1.0    1.00  0.00   1   1     0   NaN
## iq            3 50 97.32 15.55   95.0   96.55 15.57  64 137    73  0.43
## wm            4 50 54.92 13.87   56.0   55.27 14.08  23  81    58 -0.20
##            kurtosis   se
## subject       -1.27 2.06
## condition*      NaN 0.00
## iq             0.11 2.20
## wm            -0.64 1.96
## -------------------------------------------------------- 
## group: threat
##            vars  n  mean    sd median trimmed   mad min max range  skew
## subject       1 50 75.50 14.58   75.5   75.50 18.53  51 100    49  0.00
## condition*    2 50  2.00  0.00    2.0    2.00  0.00   2   2     0   NaN
## iq            3 50 86.32 13.67   85.0   86.50 13.34  51 117    66 -0.08
## wm            4 50 43.50 12.99   43.0   43.45 13.34  15  74    59  0.06
##            kurtosis   se
## subject       -1.27 2.06
## condition*      NaN 0.00
## iq            -0.22 1.93
## wm            -0.58 1.84
# Create a boxplot of the data
boxplot(med$iq ~ med$condition, main = "Boxplot", xlab = "Group condition", ylab = "IQ")

# Run the three regression models
model_yx <- lm(med$iq ~ med$condition)
model_mx <- lm(med$wm ~ med$condition)
model_yxm <- lm(med$iq ~ med$condition + med$wm)

# Make a summary of the three models
summary(model_yx)
## 
## Call:
## lm(formula = med$iq ~ med$condition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.32  -9.57  -1.82  10.68  39.68 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           97.320      2.071  46.999  < 2e-16 ***
## med$conditionthreat  -11.000      2.928  -3.756 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.64 on 98 degrees of freedom
## Multiple R-squared:  0.1259, Adjusted R-squared:  0.1169 
## F-statistic: 14.11 on 1 and 98 DF,  p-value: 0.0002928
summary(model_mx)
## 
## Call:
## lm(formula = med$wm ~ med$condition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.92  -7.75  -0.50  10.19  30.50 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           54.920      1.901  28.895  < 2e-16 ***
## med$conditionthreat  -11.420      2.688  -4.249 4.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 98 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1469 
## F-statistic: 18.05 on 1 and 98 DF,  p-value: 4.906e-05
summary(model_yxm)
## 
## Call:
## lm(formula = med$iq ~ med$condition + med$wm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.875  -7.897   0.932   6.993  27.581 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          55.9977     4.6440  12.058  < 2e-16 ***
## med$conditionthreat  -2.4075     2.3164  -1.039    0.301    
## med$wm                0.7524     0.0800   9.406 2.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.64 on 97 degrees of freedom
## Multiple R-squared:  0.5428, Adjusted R-squared:  0.5334 
## F-statistic: 57.59 on 2 and 97 DF,  p-value: < 2.2e-16
# The multilevel R package contains a function sobel()
# The sobel()function runs the whole mediation analysis which is very convenient

# Compare the previous results to the output of the sobel function
model_all <- multilevel::sobel(pred=med$condition, med=med$wm, out=med$iq)

# Print out model_all
model_all
## $`Mod1: Y~X`
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    97.32   2.070678 46.999106 4.965625e-69
## predthreat    -11.00   2.928380 -3.756342 2.927777e-04
## 
## $`Mod2: Y~X+M`
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 55.997696 4.64403367 12.057987 5.303590e-21
## predthreat  -2.407489 2.31641713 -1.039316 3.012416e-01
## med          0.752409 0.07999527  9.405669 2.576515e-15
## 
## $`Mod3: M~X`
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    54.92   1.900708 28.894501 9.487423e-50
## predthreat    -11.42   2.688007 -4.248501 4.906391e-05
## 
## $Indirect.Effect
## [1] -8.592511
## 
## $SE
## [1] 2.219233
## 
## $z.value
## [1] -3.871839
## 
## $N
## [1] 100

Introduction to Statistics with R: Multiple Regression

Chapter 1 - Introduction to Multiple Regression

Principles of Multiple Regression:

  • Multiple regression is simple regression but with more predictors
    • Typically, k is defined as the number of predictor variables
  • Still have the multiple correlation coefficient - r - which can be squared to R^2 to describe the overall variance explained by the model

Multiple regression interpretation:

  • Example of predicting faculty salaries - time since PhD, publications, gender, etc.
    • The gender variables has been included as a nominal (factor) variable with Male=0, Female=1
    • The intercept can be interpreted as the amount of salary for someone with 0 on all the predictors
    • Each regression coefficient can be interpreted as the slope, all else equal, for people at the average on all the other predictors
  • Looking at the standardized coefficients is the best way to assess the “strongest predictor” of the dependent variable

Example code includes:

# This chapter will help to develop your understanding of the basic framework of multiple regressions
# To this purpose, you will use a pre-loaded dataset fs that contains the yearly wages of professors at a certain university
# By going through the exercises, you will familiarize yourself with the implementation of, and the intuition behind, the most important aspects of multiple regression in R

# As always, start by exploring the data in order to make sure that you get a global view of the data that you are working with
# This first step is vital as it helps you to truly understand things
# As such, you will be able to move through the exercises more easily
# Use the R console to perform these actions

# fs is available in your working environment
fs <- data.frame(stringsAsFactors = FALSE, 
                 salary=c(60072, 61017, 61618, 61976, 66398, 67083, 69314, 71653, 72519, 74821, 79725, 80882, 83525, 84023, 89254, 89387, 91309, 91413, 94553, 95561, 96024, 98398, 98822, 101076, 101330, 101981, 104699, 106374, 108571, 109671, 110178, 110936, 111600, 112355, 113904, 115150, 116494, 116790, 117913, 118314, 119329, 121024, 121878, 122277, 124491, 126868, 128285, 131222, 132000, 132505, 133593, 134198, 135777, 135837, 138672, 138882, 141535, 142191, 143980, 154545, 155845, 155936, 156294, 157228, 157268, 158139, 158199, 159075, 160924, 163279, 164402, 165398, 165695, 168116, 170020, 171435, 172515, 172576, 173199, 175335, 176230, 177797, 177813, 179080, 179944, 180942, 181368, 181427, 182206, 184407, 185797, 185800, 186096, 187338, 189447, 189507, 190120, 191868, 197288, 199606), 
                 age=c(38, 39, 38, 31, 32, 39, 32, 31, 66, 66, 66, 66, 66, 32, 34, 32, 35, 38, 32, 36, 37, 55, 55, 55, 55, 48, 43, 43, 48, 42, 49, 49, 46, 47, 47, 45, 48, 46, 43, 49, 44, 40, 47, 45, 44, 45, 41, 55, 55, 45, 45, 52, 58, 59, 53, 57, 59, 58, 54, 59, 53, 59, 57, 57, 55, 55, 58, 56, 51, 44, 44, 44, 44, 44, 44, 44, 55, 61, 66, 64, 62, 60, 63, 66, 34, 66, 44, 66, 60, 64, 63, 44, 67, 60, 63, 54, 44, 61, 61, 62), 
                 years=c(16, 14, 18, 15, 30, 25, 24, 29, 32, 26, 29, 30, 31, 5, 8, 6, 7, 12, 10, 11, 5, 6, 13, 6, 5, 19, 17, 15, 21, 16, 21, 23, 17, 20, 19, 19, 22, 18, 17, 22, 17, 13, 18, 18, 18, 16, 14, 18, 15, 30, 25, 24, 29, 32, 26, 29, 30, 31, 28, 31, 24, 33, 31, 29, 29, 28, 29, 27, 22, 22, 22, 22, 22, 22, 22, 22, 32, 37, 36, 40, 40, 32, 34, 39, 39, 32, 22, 40, 32, 37, 36, 22, 40, 32, 34, 39, 39, 32, 41, 40), 
                 pubs=c(22, 23, 23, 14, 32, 27, 29, 37, 61, 69, 58, 53, 46, 67, 25, 39, 35, 37, 17, 23, 41, 21, 17, 34, 32, 66, 66, 66, 66, 25, 67, 70, 55, 62, 61, 60, 67, 59, 52, 69, 56, 42, 56, 58, 57, 52, 47, 57, 50, 91, 80, 77, 90, 66, 83, 92, 94, 98, 87, 96, 75, 66, 98, 89, 90, 88, 92, 86, 69, 77, 80, 67, 67, 66, 66, 44, 122, 125, 122, 44, 112, 110, 122, 44, 98, 104, 44, 44, 44, 122, 44, 44, 110, 122, 122, 98, 104, 101, 125, 124), 
                 dept=factor(c('P', 'P', 'P', 'P', 'P', 'P', 'P', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'S', 'S', 'S', 'S', 'S'))
                 )
str(fs)
## 'data.frame':    100 obs. of  5 variables:
##  $ salary: num  60072 61017 61618 61976 66398 ...
##  $ age   : num  38 39 38 31 32 39 32 31 66 66 ...
##  $ years : num  16 14 18 15 30 25 24 29 32 26 ...
##  $ pubs  : num  22 23 23 14 32 27 29 37 61 69 ...
##  $ dept  : Factor w/ 3 levels "H","P","S": 2 2 2 2 2 2 2 3 3 3 ...
# Perform the two single regressions and save them in a variable
model_years <- lm(salary ~ years, data=fs)
model_pubs <- lm(salary ~ pubs, data=fs)

# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))
plot(fs$salary ~ fs$years, main = "plot_years", xlab = "years", ylab = "salary")
abline(model_years)
plot(fs$salary ~ fs$pubs, main = "plot_pubs", xlab = "pubs", ylab = "salary")
abline(model_pubs)

par(mfrow = c(1, 1))


# Do a single regression of salary onto years of experience and check the output
model_1 <- lm(fs$salary ~ fs$years)
summary(model_1)
## 
## Call:
## lm(formula = fs$salary ~ fs$years)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82972  -9537   4305  17703  57949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68672.5     8259.0   8.315 5.38e-13 ***
## fs$years      2689.9      318.4   8.448 2.78e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30220 on 98 degrees of freedom
## Multiple R-squared:  0.4214, Adjusted R-squared:  0.4155 
## F-statistic: 71.37 on 1 and 98 DF,  p-value: 2.779e-13
# Do a multiple regression of salary onto years of experience and numbers of publications and check the output
model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
summary(model_2)
## 
## Call:
## lm(formula = fs$salary ~ fs$years + fs$pubs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67835 -14589   2362  13358  69613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58828.7     7605.9   7.735 9.79e-12 ***
## fs$years      1337.4      387.1   3.455 0.000819 ***
## fs$pubs        634.9      123.6   5.137 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26930 on 97 degrees of freedom
## Multiple R-squared:  0.5451, Adjusted R-squared:  0.5357 
## F-statistic: 58.12 on 2 and 97 DF,  p-value: < 2.2e-16
# Save the R squared of both models in preliminary variables
preliminary_model_1 <- summary(model_1)$r.squared
preliminary_model_2 <- summary(model_2)$r.squared

# Round them off while you save them in new variables
r_squared <- c()
r_squared[1] <- round(preliminary_model_1, 3)
r_squared[2] <- round(preliminary_model_2, 3)

# Print out the vector to see both R squared coefficients
r_squared
## [1] 0.421 0.545
# Do multiple regression and check the regression output
model_3 <- lm(fs$salary ~ fs$years + fs$pubs + fs$age)
summary(model_3)
## 
## Call:
## lm(formula = fs$salary ~ fs$years + fs$pubs + fs$age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70488 -14268   2502  13233  70413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51538.4    13726.9   3.755 0.000298 ***
## fs$years      1210.0      436.5   2.772 0.006691 ** 
## fs$pubs        618.9      126.5   4.894 3.98e-06 ***
## fs$age         227.2      355.6   0.639 0.524444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27010 on 96 degrees of freedom
## Multiple R-squared:  0.5471, Adjusted R-squared:  0.5329 
## F-statistic: 38.65 on 3 and 96 DF,  p-value: < 2.2e-16
# Round off the R squared coefficients and save the result in the vector (in one step!)
r_squared[3] <- round(summary(model_3)$r.squared, 3)

# Print out the vector in order to display all R squared coefficients simultaneously
r_squared
## [1] 0.421 0.545 0.547

Chapter 2 - Intuition Behind Estimation of Multiple Regression Coefficients

Correlation matrix of raw dataframe:

  • Begin with a raw data frame (if all of the columns are of the same class, it can easily become a data matrix) that is mxn
  • Can get the column sums by using 1xm matrix of all 1s %*% the desired matrix
    • Can then multiply each element of the resulting vector by 1/m to get the column means
  • Can get a full column of means by multiplying an nx1 vector of 1s with the column means data
    • Will then have a same-sized matrix of the column means
    • Can take the original matrix minus the column means matrix
  • Call the final matrix of differences from column means D
    • t(D) %*% D will then give the sum-squared of each column (delta from means) with each other column including itself
    • And then t(D) %% D (1/m) will then give the variance covariance matrix
  • Can then calculate the standard deviation matrix, S = diag(D) ** 0.5
  • And, then the correlation matrix is solve(S) %% D %% solve(S)

Estimation of regression coefficients:

  • Coefficients are estimated to minimize the sum-squared of the residuals
  • Suppose that the regression has been run on centered data (no intercept - Y=0 when all X=0)
    • Y-hat = X %*% B where B is the kx1 of predictors and X is the Nxk of data
    • The goal, then, is to solve for B such that sum(Y-hat**2) is minimized
  • Can then work with the inverses to make the matrix square and symmetric
    • t(X) %% Y = t(X) %% X %*% B
    • solve(t(X) %% X) %% (t(X) %*% Y) = B

Example code includes:

# Construction of 3 by 8 matrix r that contains the numbers 1 up to 24
r <- matrix(1:24, ncol=8)

# Construction of 3 by 8 matrix s that contains the numbers 21 up to 44
s <- matrix(21:44, ncol=8)

# Take the transpose t of matrix r
t <- t(r)


# Compute the sum of matrices r and s
operation_1 <- r + s

# Compute the difference between matrices r and s
operation_2 <- r - s

# Multiply matrices t and s
operation_3 <- t %*% s


X <- matrix(ncol=3, nrow=10, byrow=FALSE, 
            data=c(3, 3, 2, 4, 4, 5, 2, 3, 5, 3, 2, 2, 4, 3, 4, 4, 5, 3, 3, 5, 3, 3, 4, 4, 3, 3, 4, 2, 4, 4)
            )
str(X)
##  num [1:10, 1:3] 3 3 2 4 4 5 2 3 5 3 ...
# The raw dataframe X is already loaded in.
X
##       [,1] [,2] [,3]
##  [1,]    3    2    3
##  [2,]    3    2    3
##  [3,]    2    4    4
##  [4,]    4    3    4
##  [5,]    4    4    3
##  [6,]    5    4    3
##  [7,]    2    5    4
##  [8,]    3    3    2
##  [9,]    5    3    4
## [10,]    3    5    4
# Construction of 1 by 10 matrix I of which the elements are all 1
I <- matrix(data=1L, nrow=1, ncol=10)

# Compute the row vector of sums
t_mat <- I %*% X


# The data matrix `X` and the row vector of sums (`t_mat`) are saved and can be used.

# Number of observations
n = 10

# Compute the row vector of means
M <- t_mat * (1/n)

# Construction of 10 by 1 matrix J of which the elements are all 1
J <- matrix(data=1L, nrow=10, ncol=1)

# Compute the matrix of means
MM <- J %*% M


# Matrix of deviation scores D
D <- X - MM


# Sum of squares and sum of cross products matrix
S <- t(D) %*% D

# The previously generated matrices X, M, MM, D and S do not need to be constructed again but are saved and can be used.
n = 10

# Construct the variance-covariance matrix
C <- S * (1/n)

# Generate the standard deviations matrix
SD <- diag(x = diag(C)^(1/2), nrow = 3, ncol = 3)

# Compute the correlation matrix
R <- solve(SD) %*% C %*% solve(SD)

Chapter 3 - Dummy Coding

Introduction to dummy coding:

  • System for coding categorical variables in a regression analysis
  • Example of having sub-discipline as a categorical variable in the professor salary regression analysis
  • Dummy coding is the process of changing a categorical variables to a series of columns of 1/0
    • A reference category is set (can be any; choose for easier interpretation) and does not have a column
    • A column is then created for all the other possible levels of the factor, and a 1 is set if the observation is in that level

Dummy variables in multiple regression:

  • Can include the dummy code in the regression as per usual, with the intercept already incorporating the reference level

Introduction to effects coding:

  • Effects coding can be either unweighted or weighted
  • For unweighted effects coding, set the reference level to be -1 on all the dummy variables rather than 0 on all the dummy variables
    • The intercept then does not represent any specific group, but will be close to the mean for the entire group (discrepancy is due to different group sizes)
  • For weighted effects coding, use number of samples in the reference group as the denominator everywhere
    • For dummy C1, reference group is -nC1 / nRef while C1 group as nC1 / nRef with the others being 0
    • Same for dummy C2, C3, C4, … , etc.

Example code includes:

# Again, you will use the dataset fs
# Besides the independent variable, yearly wages (salary), and other characteristics of professors at a certain university, this dataset also contains a categorical variable (dept), that holds the information on the department that each professor belongs to
# There are three departments: history (h), psychology (p) and sociology (s)

# Summary statistics
psych::describeBy(fs, fs$dept)
## 
##  Descriptive statistics by group 
## group: H
##        vars  n      mean       sd median   trimmed      mad   min    max
## salary    1 28 137421.29 33736.24 138411 137679.38 45224.49 89387 181427
## age       2 28     49.04    11.00     48     49.04    14.08    32     66
## years     3 28     22.25    10.95     22     22.21    14.08     5     40
## pubs      4 28     63.32    29.13     57     62.62    28.17    17    122
## dept*     5 28      1.00     0.00      1      1.00     0.00     1      1
##        range  skew kurtosis      se
## salary 92040 -0.04    -1.61 6375.55
## age       34 -0.01    -1.37    2.08
## years     35 -0.03    -1.34    2.07
## pubs     105  0.29    -1.14    5.50
## dept*      0   NaN      NaN    0.00
## -------------------------------------------------------- 
## group: P
##        vars  n      mean       sd median   trimmed      mad   min    max
## salary    1 35 129067.37 43738.84 131222 130061.03 48036.24 60072 189447
## age       2 35     48.23     9.63     45     48.24    10.38    31     67
## years     3 35     22.09     8.35     22     21.93     8.90     5     40
## pubs      4 35     59.91    30.52     57     57.86    34.10    14    122
## dept*     5 35      2.00     0.00      2      2.00     0.00     2      2
##         range  skew kurtosis      se
## salary 129375 -0.18    -1.33 7393.21
## age        36  0.15    -0.97    1.63
## years      35  0.20    -0.53    1.41
## pubs      108  0.48    -0.65    5.16
## dept*       0   NaN      NaN    0.00
## -------------------------------------------------------- 
## group: S
##        vars  n      mean       sd median   trimmed      mad   min    max
## salary    1 37 135015.59 40034.84 135837 135091.03 52777.59 71653 199606
## age       2 37     53.30    10.01     54     54.10    11.86    31     66
## years     3 37     27.51     8.72     29     28.00    10.38     5     41
## pubs      4 37     76.30    28.11     67     76.19    31.13    25    125
## dept*     5 37      3.00     0.00      3      3.00     0.00     3      3
##         range  skew kurtosis      se
## salary 127953 -0.02    -1.32 6581.69
## age        35 -0.47    -0.72    1.64
## years      36 -0.45    -0.16    1.43
## pubs      100  0.21    -0.90    4.62
## dept*       0   NaN      NaN    0.00
# In order to automatically create dummy variables, the dummy.code() function of the psych package is easy to use.

# The function takes a categorical variable as argument and automatically creates the required dummy variables: all levels are ranked alphabetically and the first one is taken as the reference group
# Remember that only (N-1) dummies are created for a categorical variable with N levels
# Consequently, the category which is not directly linked with a dummy variable is defined as the reference category.

# fs is available in your working environment

# Create the dummy variables
dept_code <- psych::dummy.code(fs$dept)
dept_code
##        H P S
##   [1,] 0 1 0
##   [2,] 0 1 0
##   [3,] 0 1 0
##   [4,] 0 1 0
##   [5,] 0 1 0
##   [6,] 0 1 0
##   [7,] 0 1 0
##   [8,] 0 0 1
##   [9,] 0 0 1
##  [10,] 0 0 1
##  [11,] 0 0 1
##  [12,] 0 0 1
##  [13,] 0 0 1
##  [14,] 0 0 1
##  [15,] 0 0 1
##  [16,] 1 0 0
##  [17,] 1 0 0
##  [18,] 1 0 0
##  [19,] 1 0 0
##  [20,] 1 0 0
##  [21,] 1 0 0
##  [22,] 1 0 0
##  [23,] 0 1 0
##  [24,] 0 1 0
##  [25,] 0 1 0
##  [26,] 0 1 0
##  [27,] 0 1 0
##  [28,] 0 1 0
##  [29,] 0 1 0
##  [30,] 0 0 1
##  [31,] 0 0 1
##  [32,] 0 0 1
##  [33,] 0 0 1
##  [34,] 0 0 1
##  [35,] 0 0 1
##  [36,] 0 0 1
##  [37,] 0 0 1
##  [38,] 1 0 0
##  [39,] 1 0 0
##  [40,] 1 0 0
##  [41,] 1 0 0
##  [42,] 1 0 0
##  [43,] 1 0 0
##  [44,] 1 0 0
##  [45,] 0 1 0
##  [46,] 0 1 0
##  [47,] 0 1 0
##  [48,] 0 1 0
##  [49,] 0 1 0
##  [50,] 0 1 0
##  [51,] 0 1 0
##  [52,] 0 0 1
##  [53,] 0 0 1
##  [54,] 0 0 1
##  [55,] 0 0 1
##  [56,] 0 0 1
##  [57,] 0 0 1
##  [58,] 0 0 1
##  [59,] 0 0 1
##  [60,] 1 0 0
##  [61,] 1 0 0
##  [62,] 1 0 0
##  [63,] 1 0 0
##  [64,] 1 0 0
##  [65,] 1 0 0
##  [66,] 1 0 0
##  [67,] 0 1 0
##  [68,] 0 1 0
##  [69,] 0 1 0
##  [70,] 0 1 0
##  [71,] 0 1 0
##  [72,] 0 1 0
##  [73,] 0 1 0
##  [74,] 0 0 1
##  [75,] 0 0 1
##  [76,] 0 0 1
##  [77,] 0 0 1
##  [78,] 0 0 1
##  [79,] 0 0 1
##  [80,] 0 0 1
##  [81,] 0 0 1
##  [82,] 1 0 0
##  [83,] 1 0 0
##  [84,] 1 0 0
##  [85,] 1 0 0
##  [86,] 1 0 0
##  [87,] 1 0 0
##  [88,] 1 0 0
##  [89,] 0 1 0
##  [90,] 0 1 0
##  [91,] 0 1 0
##  [92,] 0 1 0
##  [93,] 0 1 0
##  [94,] 0 1 0
##  [95,] 0 1 0
##  [96,] 0 0 1
##  [97,] 0 0 1
##  [98,] 0 0 1
##  [99,] 0 0 1
## [100,] 0 0 1
# Merge the dataset in an extended dataframe
extended_fs <- cbind(fs, dept_code)

# Look at the extended dataframe
head(extended_fs)
##   salary age years pubs dept H P S
## 1  60072  38    16   22    P 0 1 0
## 2  61017  39    14   23    P 0 1 0
## 3  61618  38    18   23    P 0 1 0
## 4  61976  31    15   14    P 0 1 0
## 5  66398  32    30   32    P 0 1 0
## 6  67083  39    25   27    P 0 1 0
# Provide summary statistics
summary(extended_fs)
##      salary            age            years            pubs        dept  
##  Min.   : 60072   Min.   :31.00   Min.   : 5.00   Min.   : 14.00   H:28  
##  1st Qu.:101818   1st Qu.:44.00   1st Qu.:17.75   1st Qu.: 44.00   P:35  
##  Median :133049   Median :49.00   Median :23.50   Median : 66.00   S:37  
##  Mean   :133607   Mean   :50.33   Mean   :24.14   Mean   : 66.93         
##  3rd Qu.:170374   3rd Qu.:59.00   3rd Qu.:31.25   3rd Qu.: 90.00         
##  Max.   :199606   Max.   :67.00   Max.   :41.00   Max.   :125.00         
##        H              P              S       
##  Min.   :0.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00  
##  Median :0.00   Median :0.00   Median :0.00  
##  Mean   :0.28   Mean   :0.35   Mean   :0.37  
##  3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:1.00  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00
# In order to include a categorical variable in a regression, the variable needs to be converted into a numeric variable by the means of a dummy variable
# Previously, dummy variables have been generated using the intuitive, but less general dummy.code() function from the psych library

# From this point onwards the contrast C() function is used to create dummy variables
# Do not confuse this function with the c() function that is used to combine values in a vector or list
# The contrast C() takes a categorical variable as a first argument and the treatment as a second argument
# The latter tells R to rank all levels alphabetically and to take the first category as the reference group

# This exercise will illustrate the inclusion of the categorical variable dept in a multiple regression
# The code on the right estimates the regression without categorical variable
# The summary() function is used to get the summary of the regression results of model and the confint() function is used to create the confidence intervals

# Regress salary against years and publications
model <- lm(fs$salary ~ fs$years + fs$pubs)

# Apply the summary function to get summarized results for model
summary(model)
## 
## Call:
## lm(formula = fs$salary ~ fs$years + fs$pubs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67835 -14589   2362  13358  69613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58828.7     7605.9   7.735 9.79e-12 ***
## fs$years      1337.4      387.1   3.455 0.000819 ***
## fs$pubs        634.9      123.6   5.137 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26930 on 97 degrees of freedom
## Multiple R-squared:  0.5451, Adjusted R-squared:  0.5357 
## F-statistic: 58.12 on 2 and 97 DF,  p-value: < 2.2e-16
# Compute the confidence intervals for model
confint(model)
##                  2.5 %     97.5 %
## (Intercept) 43733.1268 73924.1805
## fs$years      569.0514  2105.6692
## fs$pubs       389.5972   880.2303
# Create dummies for the categorical variable fs$dept by using the C() function
dept_code <- C(fs$dept, treatment)

# Regress salary against years, publications and department
model_dummy <- lm(fs$salary ~ fs$years + fs$pubs + dept_code)

# Apply the summary function to get summarized results for model_dummy
summary(model_dummy)
## 
## Call:
## lm(formula = fs$salary ~ fs$years + fs$pubs + dept_code)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59233 -12490  -1488  13227  67294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62365.4     8299.0   7.515 3.13e-11 ***
## fs$years      1507.8      379.4   3.974 0.000138 ***
## fs$pubs        655.5      120.2   5.452 3.93e-07 ***
## dept_codeP   -5872.8     6621.8  -0.887 0.377381    
## dept_codeS  -18847.7     6706.6  -2.810 0.006011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26080 on 95 degrees of freedom
## Multiple R-squared:  0.5824, Adjusted R-squared:  0.5648 
## F-statistic: 33.12 on 4 and 95 DF,  p-value: < 2.2e-16
# Compute the confidence intervals for model_dummy
confint(model_dummy)
##                   2.5 %     97.5 %
## (Intercept)  45889.8369 78841.0109
## fs$years       754.5720  2260.9439
## fs$pubs        416.8113   894.2219
## dept_codeP  -19018.7013  7273.1563
## dept_codeS  -32162.0662 -5533.3194
anova(model, model_dummy)
## Analysis of Variance Table
## 
## Model 1: fs$salary ~ fs$years + fs$pubs
## Model 2: fs$salary ~ fs$years + fs$pubs + dept_code
##   Res.Df        RSS Df  Sum of Sq     F  Pr(>F)  
## 1     97 7.0358e+10                              
## 2     95 6.4594e+10  2 5764553224 4.239 0.01724 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# To see what role that the department plays in explaining the professors' salary, you can take a look at the actual differences in mean salary among departments.
# In order to compute the actual means of the salaries for each department easily, use the tapply() function, in which you enter the variable, the categorical variable and the requested summary statistic (that is, the mean).

tapply(fs$salary, fs$dept, FUN=mean)
##        H        P        S 
## 137421.3 129067.4 135015.6
# Number of levels
fs$dept
##   [1] P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S
##  [36] S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P
##  [71] P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S
## Levels: H P S
# Factorize the categorical variable fs$dept and name the factorized variable dept.f
dept.f <- factor(fs$dept)

# Assign the 3 levels generated in step 2 to dept.f
contrasts(dept.f) <-contr.sum(3)

# Regress salary against dept.f
model_unweighted <- lm(fs$salary ~ dept.f)

# Apply the summary() function
summary(model_unweighted)
## 
## Call:
## lm(formula = fs$salary ~ dept.f)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68995 -27249   1488  36471  64590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   133835       4007  33.403   <2e-16 ***
## dept.f1         3586       5907   0.607    0.545    
## dept.f2        -4767       5579  -0.855    0.395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39780 on 97 degrees of freedom
## Multiple R-squared:  0.007771,   Adjusted R-squared:  -0.01269 
## F-statistic: 0.3799 on 2 and 97 DF,  p-value: 0.685
# Factorize the categorical variable fs$dept and name the factorized variable dept.g
dept.g <- factor(fs$dept)

weights <- matrix(data=c(-1.25, 1.25, 0, -1.32, 0, 1.32), ncol=2, nrow=3, byrow=FALSE)
weights
##       [,1]  [,2]
## [1,] -1.25 -1.32
## [2,]  1.25  0.00
## [3,]  0.00  1.32
# Assign the weights matrix to dept.g
contrasts(dept.g) <- weights

# Regress salary against dept.f and apply the summary() function
model_weighted <- lm(fs$salary ~ dept.g)

# Apply the summary() function
summary(model_weighted)
## 
## Call:
## lm(formula = fs$salary ~ dept.g)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68995 -27249   1488  36471  64590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 133834.8     4006.6  33.403   <2e-16 ***
## dept.g1      -3813.9     4462.9  -0.855    0.395    
## dept.g2        894.6     4170.6   0.214    0.831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39780 on 97 degrees of freedom
## Multiple R-squared:  0.007771,   Adjusted R-squared:  -0.01269 
## F-statistic: 0.3799 on 2 and 97 DF,  p-value: 0.685

Cluster Analysis in R

Chapter 1 - Calculating Distance Between Objects

What is cluster analysis:

  • Data exploration technique to help make sense of the data
  • The objective of clustering is to build groups where members of the group tend to be similar to each other and different from members of other groups
    • Clustering is a form of EDA where observations are split such that they have similar features
  • Typical clustering process flow includes
    • Pre-process data (no missing values, common scales for data)
    • Select similarity measure (hierarchical or k-means)
    • Cluster
    • Analyze clusters

Distance between two observations:

  • Objective is to get a measure of the dissimilarity between the observations - minimize the dissimilarity within segments while maximizing across segments
  • Euclidean distance is sometimes used, especially if the data have already been scaled
    • The dist(myDF, method=“euclidean”) function in R will calculate all the Euclidean distances

Importance of scale:

  • Features need to be of comparable scale for the Euclidean distances to make sense (e.g., feet in one column with inches in another would likely make little sense in segmentation)
  • Generally, should change the features to have roughly the same variance
    • One common method is standadrization, or converting every observation to (x - mu) / sd where mu, sd are the column means, standard deviations
    • The scale() function in R will standardize each of the feature columns to N(0, 1)

Measuring distance for categorical data:

  • Binary data is the simplest type of categorical for Euclidean-distance clustering
    • The Jaccard index is the intersection (number of mutual TRUE) divided by the union (number of collective TRUE)
    • So, if there are four binary features and two people have one answer of TRUE in common, their Jaccard score is 0.25
    • The distance score can then be said to be 1 minus the Jaccard distance
    • Can run this in R using dist(myDF, method=“binary”)
  • Categorical data with 3+ levels is more complex to score for distance
    • Start by taking each feature-value pair, making it a column, and scoring as 1 if the observations has that feature-value and 0 otherwise (basically, dummify the data)
    • Can then calculate the Jaccard distance for the dummified frame
    • The dummies::dummy.data.frame(myDF) can run this automatically provided that the categorical variables are all encoded as factors
    • Can run this in R using dist(dummies::dummy.data.frame(myDF, method=“binary”))

Example code includes:

playersData <- readRDS("./RInputFiles/lineup.rds")

two_players <- data.frame(x=c(5, 15), y=c(4, 10))
two_players
##    x  y
## 1  5  4
## 2 15 10
# Plot the positions of the players
ggplot(two_players, aes(x = x, y = y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619
# Calculate the Distance Between two_players
dist_two_players <- dist(two_players, method="euclidean")
dist_two_players
##         1
## 2 11.6619
# Create three_players data
three_players <- two_players %>% rbind(c(0, 20))
three_players
##    x  y
## 1  5  4
## 2 15 10
## 3  0 20
# Calculate the Distance Between three_players
dist_three_players <- dist(three_players, method="euclidean")
dist_three_players
##          1        2
## 2 11.66190         
## 3 16.76305 18.02776
# You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations
# Lets see this in action by observing a sample of data from the trees data set

# You will leverage the scale() function which by default centers & scales our column features.  Our variables are the following:
# Girth - tree diameter in inches
# Height - tree height in inches

# Create three_trees data
three_trees <- data.frame(Girth=c(8.3, 8.6, 10.5), Height=c(840, 780, 864))
three_trees
##   Girth Height
## 1   8.3    840
## 2   8.6    780
## 3  10.5    864
# Calculate distance for three_trees 
dist_trees <- dist(three_trees)

# Scale three trees & calculate the distance  
scaled_three_trees <-scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)

# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
##          1        2
## 2 60.00075         
## 3 24.10062 84.02149
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
##          1        2
## 2 1.409365         
## 3 1.925659 2.511082
# In this exercise you will explore how to calculate binary (Jaccard) distances
# In order to calculate distances will first have to dummify our categories using the dummy.data.frame() from the library dummies

# You will use a small collection of survey observations stored in the data frame job_survey with the following columns:
# job_satisfaction Possible options: "Hi", "Mid", "Low"
# is_happy Possible options: "Yes", "No"

# Create job_survey data
job_survey <- data.frame(job_satisfaction=factor(c("Low", "Low", "Hi", "Low", "Mid")), 
                         is_happy=factor(c("No", "No", "Yes", "No", "No"))
                         )
job_survey
##   job_satisfaction is_happy
## 1              Low       No
## 2              Low       No
## 3               Hi      Yes
## 4              Low       No
## 5              Mid       No
# Dummify the Survey Data
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
dummy_survey <- dummy.data.frame(job_survey)

# Calculate the Distance
dist_survey <- dist(dummy_survey, method="binary")

# Print the Orignal Data
job_survey
##   job_satisfaction is_happy
## 1              Low       No
## 2              Low       No
## 3               Hi      Yes
## 4              Low       No
## 5              Mid       No
# Print the Distance Matrix
dist_survey
##           1         2         3         4
## 2 0.0000000                              
## 3 1.0000000 1.0000000                    
## 4 0.0000000 0.0000000 1.0000000          
## 5 0.6666667 0.6666667 1.0000000 0.6666667

Chapter 2 - Hierarchical Clustering